Research Article An Efficient MapReduce-Based Parallel...

19
Research Article An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division Dawen Xia, 1,2 Binfeng Wang, 1 Yantao Li, 1 Zhuobo Rong, 1 and Zili Zhang 1,3 1 School of Computer and Information Science, Southwest University, Chongqing 400715, China 2 School of Information Engineering, Guizhou Minzu University, Guiyang 550025, China 3 School of Information Technology, Deakin University, Waurn Ponds, VIC 3216, Australia Correspondence should be addressed to Zili Zhang; [email protected] Received 21 April 2015; Revised 12 July 2015; Accepted 13 August 2015 Academic Editor: Hubertus Von Bremen Copyright © 2015 Dawen Xia et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs). Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel ree-Phase K-Means (Par3PKM) algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy of K-Means and then employ a MapReduce paradigm to redesign the optimized K-Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared with K-Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data. 1. Introduction With the increasingly rapid economic globalization and urbanization, traffic congestion becomes a critical problem and causes great concern among people and the governments in metropolises [1, 2]. To alleviate traffic congestion, a large amount of money was spent on traffic planning and management, especially in traffic subarea division as this is crucial for ITSs. Traffic subarea division is to divide the whole traffic area into several different subareas in order to build a multiarea hierarchical control system, based on the similarity and correlation of traffic pattern features. Traffic subarea division is an effective method when managing traffic control systems. is is because a complete urban traffic system is usually large and complicated, and it is difficult to analyze some traffic problems as a whole. Dividing a whole urban traffic system into different traffic subareas and then studying each traffic subarea system can reduce the complexity of traffic network analysis. erefore, traffic subarea division is also a powerful tool to analyze complex traffic networks. More importantly, traffic subarea division can provide decision-making for traffic planning and management. Recent years saw the big data era [3] for transportation coming. Massive traffic data have been growing rapidly with 5Vs characteristics (i.e., Volume, Velocity, Variety, Value, and Veracity), thereby attracting great attention among people from all walks of life (e.g., industries [4], academia [5, 6], governments [7], and other organizations [8, 9]). In particu- lar, taxicabs are equipped with GPS sensors for dispatching and safety in most modern cities; thus, a large number of GPS trajectories of taxicabs with their present location, geoposition, time stamp, and occupancy information are being generated to a data center in a certain frequency every day [10, 11]. For instance, based on a Hadoop platform with ArcGIS, we employ large-scale taxi trajectories used in this work, to produce the road network of Beijing (as illustrated in Figure 1) essentially in agreement with the real traffic map. Figure 1 shows the density distribution of the GPS points (1,232,048 records) generated by 12,000 taxicabs in Beijing during one hour (00:14:35–01:14:34) on November 1, 2012. Hindawi Publishing Corporation Discrete Dynamics in Nature and Society Volume 2015, Article ID 793010, 18 pages http://dx.doi.org/10.1155/2015/793010

Transcript of Research Article An Efficient MapReduce-Based Parallel...

Page 1: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Research ArticleAn Efficient MapReduce-Based Parallel Clustering Algorithmfor Distributed Traffic Subarea Division

Dawen Xia12 Binfeng Wang1 Yantao Li1 Zhuobo Rong1 and Zili Zhang13

1School of Computer and Information Science Southwest University Chongqing 400715 China2School of Information Engineering Guizhou Minzu University Guiyang 550025 China3School of Information Technology Deakin University Waurn Ponds VIC 3216 Australia

Correspondence should be addressed to Zili Zhang zhangzlswueducn

Received 21 April 2015 Revised 12 July 2015 Accepted 13 August 2015

Academic Editor Hubertus Von Bremen

Copyright copy 2015 Dawen Xia et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems(ITSs) Since existing methods may not be suitable for big traffic data processing this paper presents a MapReduce-based ParallelThree-Phase K-Means (Par3PKM) algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributedcomputing platform Specifically we first modify the distance metric and initialization strategy of K-Means and then employ aMapReduce paradigm to redesign the optimizedK-Means algorithm for parallel clustering of large-scale taxi trajectoriesMoreoverwe propose a boundary identifying method to connect the borders of clustering results for each cluster Finally we divide trafficsubarea of Beijing based on real-world trajectory data sets generated by 12000 taxis in a period of one month using the proposedapproach Experimental evaluation results indicate that when compared withK-Means Par2PK-Means and ParCLARA Par3PKMachieves higher efficiencymore accuracy and better scalability and can effectively divide traffic subareawith big taxi trajectory data

1 Introduction

With the increasingly rapid economic globalization andurbanization traffic congestion becomes a critical problemand causes great concern among people and the governmentsin metropolises [1 2] To alleviate traffic congestion alarge amount of money was spent on traffic planning andmanagement especially in traffic subarea division as this iscrucial for ITSs Traffic subarea division is to divide the wholetraffic area into several different subareas in order to build amultiarea hierarchical control system based on the similarityand correlation of traffic pattern features Traffic subareadivision is an effective method whenmanaging traffic controlsystems This is because a complete urban traffic system isusually large and complicated and it is difficult to analyzesome traffic problems as a whole Dividing a whole urbantraffic system into different traffic subareas and then studyingeach traffic subarea system can reduce the complexity oftraffic network analysis Therefore traffic subarea divisionis also a powerful tool to analyze complex traffic networks

More importantly traffic subarea division can providedecision-making for traffic planning and management

Recent years saw the big data era [3] for transportationcoming Massive traffic data have been growing rapidly with5Vs characteristics (ie Volume Velocity Variety Value andVeracity) thereby attracting great attention among peoplefrom all walks of life (eg industries [4] academia [5 6]governments [7] and other organizations [8 9]) In particu-lar taxicabs are equipped with GPS sensors for dispatchingand safety in most modern cities thus a large numberof GPS trajectories of taxicabs with their present locationgeoposition time stamp and occupancy information arebeing generated to a data center in a certain frequency everyday [10 11] For instance based on a Hadoop platform withArcGIS we employ large-scale taxi trajectories used in thiswork to produce the road network of Beijing (as illustratedin Figure 1) essentially in agreement with the real traffic mapFigure 1 shows the density distribution of the GPS points(1232048 records) generated by 12000 taxicabs in Beijingduring one hour (001435ndash011434) on November 1 2012

Hindawi Publishing CorporationDiscrete Dynamics in Nature and SocietyVolume 2015 Article ID 793010 18 pageshttpdxdoiorg1011552015793010

2 Discrete Dynamics in Nature and Society39∘409984000998400998400

N39∘509984000998400998400

N40∘09984000998400998400

N

116∘10

9984000998400998400E 116

∘20

9984000998400998400E 116

∘30

9984000998400998400E 116

∘40

9984000998400998400E

1488

0

(a)

39∘509984000998400998400

N40∘09984000998400998400

N

116∘20

9984000998400998400E 116

∘30

9984000998400998400E

1488

0

(b)

Figure 1 Road network produced via GPS points of taxi trajectories with density distribution (a) Overview of Beijing and (b) the 5th RingRoad of Beijing

Naturally taxi trajectory data are becoming an importantmobile trajectory data source that is widely utilized by indus-tries academia and governments for many practical appli-cations particularly clustering GPS trajectories of taxicabs toprovide a new idea for traffic subarea division119870-Means is themost commonly used partitional clustering algorithm [12]but it suffers three major shortcomings [13] (i) its scalabilityis poor (ii) 119870 needs to be specified by the users and (iii)the search is liable to local minima Furthermore 119870-Meanshas some bottlenecks in clustering the explosive growth oftaxi trajectories such as high memory consumption and IOcost low performance and poor reliability In particular theexecution time of 119870-Means is proportional to the productof the number of patterns and clusters in each iteration Thecomputational cost would be very high especially for largedata sets Obviously the sequential version of existing 119870-Means algorithms is not good at processing large-scale taxitrajectories on a single machine

Recently several parallel 119870-Means algorithms [14ndash22]have been proposed to meet the rapidly growing demandsof clustering big data sets Meanwhile some methods [23ndash31] have been presented for traffic subarea division All theprevious approaches have almost achieved desirable proper-ties but also have some limitations especially the capacityof data processing that has not been improved substantiallyand thus might have difficulty in dividing traffic subareawith a large number of GPS trajectories of taxicabs To meetthese challenges this paper focuses on the improvementand parallelism of 119870-Means to enhance the accuracy andefficiency of large-scale taxi trajectories clustering therebysolving traffic subarea division problem

In this paper we put forward a parallel 119870-Meansoptimization algorithm (Par3PKM) and implement it in aMapReduce framework on a Hadoop platform Also wedivide traffic subarea of Beijing with a large number ofGPS trajectories of taxicabs through our distributed trafficsubarea division (DTSAD) method More specifically thedistance metric and the initialization strategy of 119870-Means

are modified in Par3PKM and the time-consuming iterationis accomplished in the MapReduce model of computationThe accuracy and efficiency of clustering large-scale taxitrajectories are significantly improved On the other handto save huge amounts of communication memory consump-tion and IO cost with MapReduce DTSAD is performedon a distributed computing platform Hadoop Particularlyin DTSAD a boundary identifying method can accuratelyconnect the borders of each cluster

The contributions of this work are summarized as follows

(i) An efficient parallel clustering algorithm (Par3PKM)is proposed to address the existing problems of thetraditional119870-Means algorithm in processing massivetraffic data The evaluation results demonstrate thatthe Par3PKM algorithm can efficiently cluster a largenumber of GPS trajectories of taxicabs In particularit can offer a practical reference for implementingparallel computing of the same type of algorithms

(ii) A new distributed division method (DTSAD) is pre-sented to divide traffic subarea with large-scale taxitrajectories and a boundary identifyingmethod is putforward to connect the borders of clustering resultsThe experimental results indicate that the DTSADmethod can accurately divide traffic subarea based onthe proposed Par3PKM algorithm

(iii) The aforementioned approach is applied to the trafficsubarea division of Beijing using big taxi trajec-tory data on a Hadoop platform with MapReduceand ArcGIS Case studies show that the proposedapproach significantly improves the accuracy andefficiency of traffic subarea division especially thedivision results that are consistent with the real trafficconditions of corresponding areas in Beijing

The remainder of this paper is organized as followsSection 2 reviews related work and the motivation withsolution is given in Section 3 In Section 4 the Par3PKM

Discrete Dynamics in Nature and Society 3

algorithm is described in detail Section 5 presents theapplications of our approach and then the division resultsare analyzed In Section 6 we evaluate the performance ofthe proposed algorithm and discuss the experimental resultsand then conclude the paper in Section 7

2 Related Work

In this section we first briefly review related work on thetraffic subarea division and the parallel 119870-Means algorithmsand then overview the MapReduce framework used in thispaper

21 Traffic Subarea Division The concept of traffic subareawas first proposed by Walinchus [23] and various methodsof traffic subarea division were subsequently developed forvarious ITSs applications including easing traffic congestionWong et al [24] presented a time-dependent TRANSYT(Traffic Network Study Tool) traffic model for area trafficcontrol and Robertson and Bretherton [25] introduced aSCOOT (Split Cycle Offset Optimization Technique)methodto optimize networks of traffic signals in real time Maand Yang [26] designed an expert system of traffic subareadivision to trade off several demands for managing trafficnetwork and reducing the complexity of traffic control basedon an integrated correlation index Lu et al [27 28] builta model of partitioning traffic control subarea based oncorrelation degree analysis and optimized the strategy ofsubarea division Guo et al [29] provided a dynamic trafficcontrol subarea division method according to the similarityof adjacent intersections

Furthermore some researchers put forward the dimen-sion-reduced processing and genetic algorithms to optimizesubarea division Li et al [30] proposed a method to dividetraffic control subarea dynamically on the basis of BackPropagation (BP) neural networkThismethod divides trafficsubarea by considering traffic flow distance of intersectionsand cycle In addition to improve the performance of large-scale urban traffic networks division Zhou et al [31] pre-sented a new approach to calculating the correlation degreewhich determines the desire for interconnection between twoadjacent intersections

Obviously all the solutions mentioned above have manydesirable properties but may have difficulty in processing alarge number of taxi trajectories In this work we presenta distributed traffic subarea division (DTSAD) method toimprove the accuracy and efficiency of division using large-scale GPS trajectories of taxicabs on a Hadoop platform withMapReduce and ArcGIS

22 Parallel 119870-Means Algorithms The 119870-Means algorithmproposed by MacQueen [32] is the most popular clusteringmethod to partition a given data set into a user-specifiednumber of clusters (ie 119870) [33ndash35] and needs several iter-ations over data sets before converging on a solution The119870-Means algorithm often converges to a local optimumTo address this problem an incremental 119870-Means (IKM)algorithm [36] was put forward to empirically reach theglobal optimum but has a higher complexity Nevertheless

the traditional 119870-Means algorithm requires several scansover data sets which have to be fully loaded into memoryin order to enhance the efficiency of data accessing but thisrequirement is hard to fulfill for handling massive data [34]To overcome this drawback Two-Phase 119870-Means [37] wasintroduced and can robustly find a good solution in oneiteration by employing a buffering technique but it is possiblyunsuited to processing huge amounts of data

Subsequently Kantabutra and Couch [14] presented aparallel 119870-Means algorithm on a Message-Passing Interface(MPI) framework of a network of workstations (NOWs)and the main idea is that all data objects of a cluster arestored in a slave node Data rearrangement of this methodneeds large-scale data transmission between slaves whichmakes it difficult to deal with big data sets Zhang et al [15]implemented parallel 119870-Means based on a Parallel VirtualMachine (PVM) framework and performed on NOWs andthis parallel method requires the full load of the data sets onthe master node and synchronization of data at the end ofan iteration Kraj et al [16] proposed a parallel ParaKMeansalgorithm which makes good use of the multithreadingmethod on a single machine to accomplish cluster andemploys sufficient statistics to measure the quality of clusterin the stop condition Pakhira [17] introduced a distributed119870-Means algorithm which was executed on multiprocessormachines through randomly delivering the split data subsetto each processor Kohlhoff et al [18] developed parallel 119870-Means (119870119901119904-Means) with GPUs implementation which isefficient regardless of the dimensionality of the input dataBecause the communication of different slaves takes up alarge number of IO resources and consumes huge amountsof time these methods might have difficulty in dealing withlarge-scale data

Moreover to cluster big data effectively some researchersimplemented the parallelismof the119870-Means algorithmbasedon a MapReduce framework Chu et al [19] developed apragmatic and general framework that adopted the parallelprogrammingmethod ofMapReduce on a variety of learningalgorithms containing119870-Means Zhao et al [20] put forwarda PKMeans algorithm using MapReduce and Zhou et al [21]implemented an automatic classification of a large numberof documents by PKMeans Nguyen et al [22] introduced aparallel Two-Phase 119870-Means algorithm (Par2PK-Means) toovercome the limitation of available parallel versions and itachieves a better speedup than the sequential version

To the best of our knowledge all the aforementionedefforts are not successfully applied in solving traffic subareadivision problem with large-scale taxi trajectories Howeverwe propose a parallel clustering algorithm (Par3PKM) basedon MapReduce for parallel clustering of a large number ofGPS trajectories of taxicabs in this work Different fromexisting methods in Par3PKM we modify the distancemetric and initialization strategy of 119870-Means for improvingthe accuracy of clustering and implement parallel computingof iteration in three phases to enhance the efficiency ofcomputation Furthermore the addition of the Combinerfunction is employed to cut down the amount of datashuffled between theMap tasks and the Reduce tasks therebyreducing the computational complexity of MapReduce job

4 Discrete Dynamics in Nature and Society

Task with

Node 1 Node 2 Node nTaskTracker TaskTracker TaskTracker

Database DataNodeDataNode DataNodeDatabase Database

InputFormat

SQL query

SMS plannerMapReduce job

Hadoop coreMaster node

HDFS

NameNode

MapReduce job

Dat

a loa

der

Cata

log

JobTracker

Input format implementationsDatabase

MapReduce

connector

Figure 2 Architectures of HDFS and MapReduce

and saving the limited bandwidth available on a Hadoopcluster

23MapReduce Framework HadoopDistributed File System(HDFS) [38] and Hadoop MapReduce [39 40] are two corecomponents of Hadoop [41ndash43] based on the open-sourceimplementation of GFS [44] and MapReduce [45] And theirarchitectures are depicted in Figure 2 For further detailson Hadoop see the Apache Hadoop website (httphadoopapacheorg)

MapReduce is a parallel processing paradigm that allowsfor massive scalability across hundreds or thousands ofservers on a Hadoop cluster [46] and particularly providesefficient computing framework to deal with big taxi trajectorydata for traffic subarea division As a typicalmethodology theprocessing of MapReduce jobs includes the Map phase andthe Reduce phase Each phase has key-value pairs as input andoutput the types ofwhichmay be selected by the programmerthat specifies theMap function and the Reduce function [40]A simple example is illustrated in Figure 3 which shows thelogical data flow of a simple MapReduce job that calculatesthe maximum temperature for each year by mining hugeamounts of weather data sets

In thiswork based on aMapReduce framework the time-consuming iterations of the proposed Par3PKM algorithmare performed in three phases with the Map function theCombiner function and theReduce function and the parallelcomputing process of MapReduce is shown in Figure 4Specifically in Par3PKM the incremental Combiner functionis executed between the Map tasks and the Reduce taskswhich can reduce the computational complexity of MapRe-duce jobs and save the limited bandwidth available on aHadoop cluster

3 Motivation

In this section we describe the motivation of this work andgive a reasonable solution for solving traffic subarea divisionproblem

Naturally GPS-equipped taxi is an essential public traffictool inmodern citiesOver the last decade taxi trajectory datahave been exploding and have become the most importantmobile trajectory data source with the advantages of broadcovering extra precision excellent continuity little privacyand so forth Thus for solving traffic subarea division prob-lem how to substantially improve the capacity of processinglarge-scale GPS trajectories of taxicabs poses an urgentchallenge to us On the other hand as one of the most well-known clustering techniques in practice the traditional 119870-Means algorithm is a simple iterative approach but has veryhigh time complexity and memory consumption The timerequirements for119870-Means are basically linear in the numberof data objects The time required is 119874(119870 lowast 119868 lowast 119899 lowast119898) where119870 le 119899 119868 le 119899 119870 denotes the number of desired clusters119868 is the number of iterations required for convergence 119899 isthe total number of objects and 119898 represents the numberof attributes in the given data sets In particular the storagerequired is 119874((119870 + 119899) lowast 119898) Obviously the efficiency ofthe serial 119870-Means algorithm is low in handling a largenumber of taxi trajectories with limited memory on a singlemachine Hence to enhance the accuracy and efficiency ofclustering another challenge is how to optimize 119870-Meansand implement it in a MapReduce framework on a Hadoopplatform

These challenges motivate the development of thePar3PKM algorithm with the DTSAD method and a newsolution to the above problems is illustrated in Figure 5As shown in Figure 5 based on a Hadoop platform withMapReduce and ArcGIS the process of the DTSAD methodmainly includes the following steps First we preprocesscorrelation data extracted from large-scale taxi trajectoriesThen we cluster huge amounts of trajectory data in parallelusing the proposed Par3PKM algorithm as described inSection 4 Finally we identify the borders of clustering resultsto build each traffic subarea by our boundary identifyingmethod as depicted in Section 53

Clearly the key to this solution is the accuracy andefficiency of clustering large-scale taxi trajectories whichdetermines the overall performance of traffic subarea divi-sion Thus the aim of this paper is to put forward an efficientparallel clustering algorithm (Par3PKM) with MapReduceimplementation

4 The Proposed Par3PKM Algorithm

In this section the ParallelThree-Phase119870-Means (Par3PKM)algorithm is proposed for efficiently and accurately clusteringa large number of GPS trajectories of taxicabs under aMapReduce framework on Hadoop

41 Overview and Notation The process of the Par3PKMalgorithm is depicted in Figure 6 Based on a MapReduceframework the Par3PKM algorithm first chooses 119870 of the

Discrete Dynamics in Nature and Society 5

input map shuffle

sort

output

(2013 [12 46]) (2013 46) (2014 48)

(2013 12) (2013 46)

(2014 4 8)

2013 46 2014 48

reduce gt

0047014567

0053011990

0053211990

middot middot middotmiddot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

(0 0047014567 )

(106 0053011990 )

(212 0053211990 )(2014 minus11) (2014 [minus11 48])

reducerb gtmaprbcatlowast

Figure 3 Logical data flow of a simple MapReduce job

Split 0

Split 1

Split 2

Split 3

Split 4

Input files

WorkerWorker

WorkerWorker

Worker

Map phase

Userprogram

Master

Combine phase Reduce phase Output files

(3) Read

(1) Fork (1) Fork

(1) Fork(2) Assign Map

(4) Local

(2) Assign

(5) Remote (6) Write

Reduce

Read

Write

Output

OutputFile 0

File 1

Figure 4 Parallel computing process of MapReduce

DTSAD method

Large-scale taxi trajectories

Data preprocessing

Parallel clustering

Boundary identifying

Traffic subarea

Hadoop with MapReduce

Division framework

ArcGIS

Figure 5 Process and framework of the DTSAD method

objects as initial centroids where119870 is the number of desiredclusters Then each of the remaining objects is assigned

Input

Initialize centroids

Parallel executionof iteration

Compute the distance between eachobject and each cluster center

Assign each object to its closest centroid

Recompute the centroid of each cluster

Criterion function

converges

OutputYes

No

Select K

Figure 6 Process of the Par3PKM algorithm

to the cluster to which it is the most similar on the basisof the distance between the object and the cluster centerFinally it computes the new center for each cluster and this

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

2 Discrete Dynamics in Nature and Society39∘409984000998400998400

N39∘509984000998400998400

N40∘09984000998400998400

N

116∘10

9984000998400998400E 116

∘20

9984000998400998400E 116

∘30

9984000998400998400E 116

∘40

9984000998400998400E

1488

0

(a)

39∘509984000998400998400

N40∘09984000998400998400

N

116∘20

9984000998400998400E 116

∘30

9984000998400998400E

1488

0

(b)

Figure 1 Road network produced via GPS points of taxi trajectories with density distribution (a) Overview of Beijing and (b) the 5th RingRoad of Beijing

Naturally taxi trajectory data are becoming an importantmobile trajectory data source that is widely utilized by indus-tries academia and governments for many practical appli-cations particularly clustering GPS trajectories of taxicabs toprovide a new idea for traffic subarea division119870-Means is themost commonly used partitional clustering algorithm [12]but it suffers three major shortcomings [13] (i) its scalabilityis poor (ii) 119870 needs to be specified by the users and (iii)the search is liable to local minima Furthermore 119870-Meanshas some bottlenecks in clustering the explosive growth oftaxi trajectories such as high memory consumption and IOcost low performance and poor reliability In particular theexecution time of 119870-Means is proportional to the productof the number of patterns and clusters in each iteration Thecomputational cost would be very high especially for largedata sets Obviously the sequential version of existing 119870-Means algorithms is not good at processing large-scale taxitrajectories on a single machine

Recently several parallel 119870-Means algorithms [14ndash22]have been proposed to meet the rapidly growing demandsof clustering big data sets Meanwhile some methods [23ndash31] have been presented for traffic subarea division All theprevious approaches have almost achieved desirable proper-ties but also have some limitations especially the capacityof data processing that has not been improved substantiallyand thus might have difficulty in dividing traffic subareawith a large number of GPS trajectories of taxicabs To meetthese challenges this paper focuses on the improvementand parallelism of 119870-Means to enhance the accuracy andefficiency of large-scale taxi trajectories clustering therebysolving traffic subarea division problem

In this paper we put forward a parallel 119870-Meansoptimization algorithm (Par3PKM) and implement it in aMapReduce framework on a Hadoop platform Also wedivide traffic subarea of Beijing with a large number ofGPS trajectories of taxicabs through our distributed trafficsubarea division (DTSAD) method More specifically thedistance metric and the initialization strategy of 119870-Means

are modified in Par3PKM and the time-consuming iterationis accomplished in the MapReduce model of computationThe accuracy and efficiency of clustering large-scale taxitrajectories are significantly improved On the other handto save huge amounts of communication memory consump-tion and IO cost with MapReduce DTSAD is performedon a distributed computing platform Hadoop Particularlyin DTSAD a boundary identifying method can accuratelyconnect the borders of each cluster

The contributions of this work are summarized as follows

(i) An efficient parallel clustering algorithm (Par3PKM)is proposed to address the existing problems of thetraditional119870-Means algorithm in processing massivetraffic data The evaluation results demonstrate thatthe Par3PKM algorithm can efficiently cluster a largenumber of GPS trajectories of taxicabs In particularit can offer a practical reference for implementingparallel computing of the same type of algorithms

(ii) A new distributed division method (DTSAD) is pre-sented to divide traffic subarea with large-scale taxitrajectories and a boundary identifyingmethod is putforward to connect the borders of clustering resultsThe experimental results indicate that the DTSADmethod can accurately divide traffic subarea based onthe proposed Par3PKM algorithm

(iii) The aforementioned approach is applied to the trafficsubarea division of Beijing using big taxi trajec-tory data on a Hadoop platform with MapReduceand ArcGIS Case studies show that the proposedapproach significantly improves the accuracy andefficiency of traffic subarea division especially thedivision results that are consistent with the real trafficconditions of corresponding areas in Beijing

The remainder of this paper is organized as followsSection 2 reviews related work and the motivation withsolution is given in Section 3 In Section 4 the Par3PKM

Discrete Dynamics in Nature and Society 3

algorithm is described in detail Section 5 presents theapplications of our approach and then the division resultsare analyzed In Section 6 we evaluate the performance ofthe proposed algorithm and discuss the experimental resultsand then conclude the paper in Section 7

2 Related Work

In this section we first briefly review related work on thetraffic subarea division and the parallel 119870-Means algorithmsand then overview the MapReduce framework used in thispaper

21 Traffic Subarea Division The concept of traffic subareawas first proposed by Walinchus [23] and various methodsof traffic subarea division were subsequently developed forvarious ITSs applications including easing traffic congestionWong et al [24] presented a time-dependent TRANSYT(Traffic Network Study Tool) traffic model for area trafficcontrol and Robertson and Bretherton [25] introduced aSCOOT (Split Cycle Offset Optimization Technique)methodto optimize networks of traffic signals in real time Maand Yang [26] designed an expert system of traffic subareadivision to trade off several demands for managing trafficnetwork and reducing the complexity of traffic control basedon an integrated correlation index Lu et al [27 28] builta model of partitioning traffic control subarea based oncorrelation degree analysis and optimized the strategy ofsubarea division Guo et al [29] provided a dynamic trafficcontrol subarea division method according to the similarityof adjacent intersections

Furthermore some researchers put forward the dimen-sion-reduced processing and genetic algorithms to optimizesubarea division Li et al [30] proposed a method to dividetraffic control subarea dynamically on the basis of BackPropagation (BP) neural networkThismethod divides trafficsubarea by considering traffic flow distance of intersectionsand cycle In addition to improve the performance of large-scale urban traffic networks division Zhou et al [31] pre-sented a new approach to calculating the correlation degreewhich determines the desire for interconnection between twoadjacent intersections

Obviously all the solutions mentioned above have manydesirable properties but may have difficulty in processing alarge number of taxi trajectories In this work we presenta distributed traffic subarea division (DTSAD) method toimprove the accuracy and efficiency of division using large-scale GPS trajectories of taxicabs on a Hadoop platform withMapReduce and ArcGIS

22 Parallel 119870-Means Algorithms The 119870-Means algorithmproposed by MacQueen [32] is the most popular clusteringmethod to partition a given data set into a user-specifiednumber of clusters (ie 119870) [33ndash35] and needs several iter-ations over data sets before converging on a solution The119870-Means algorithm often converges to a local optimumTo address this problem an incremental 119870-Means (IKM)algorithm [36] was put forward to empirically reach theglobal optimum but has a higher complexity Nevertheless

the traditional 119870-Means algorithm requires several scansover data sets which have to be fully loaded into memoryin order to enhance the efficiency of data accessing but thisrequirement is hard to fulfill for handling massive data [34]To overcome this drawback Two-Phase 119870-Means [37] wasintroduced and can robustly find a good solution in oneiteration by employing a buffering technique but it is possiblyunsuited to processing huge amounts of data

Subsequently Kantabutra and Couch [14] presented aparallel 119870-Means algorithm on a Message-Passing Interface(MPI) framework of a network of workstations (NOWs)and the main idea is that all data objects of a cluster arestored in a slave node Data rearrangement of this methodneeds large-scale data transmission between slaves whichmakes it difficult to deal with big data sets Zhang et al [15]implemented parallel 119870-Means based on a Parallel VirtualMachine (PVM) framework and performed on NOWs andthis parallel method requires the full load of the data sets onthe master node and synchronization of data at the end ofan iteration Kraj et al [16] proposed a parallel ParaKMeansalgorithm which makes good use of the multithreadingmethod on a single machine to accomplish cluster andemploys sufficient statistics to measure the quality of clusterin the stop condition Pakhira [17] introduced a distributed119870-Means algorithm which was executed on multiprocessormachines through randomly delivering the split data subsetto each processor Kohlhoff et al [18] developed parallel 119870-Means (119870119901119904-Means) with GPUs implementation which isefficient regardless of the dimensionality of the input dataBecause the communication of different slaves takes up alarge number of IO resources and consumes huge amountsof time these methods might have difficulty in dealing withlarge-scale data

Moreover to cluster big data effectively some researchersimplemented the parallelismof the119870-Means algorithmbasedon a MapReduce framework Chu et al [19] developed apragmatic and general framework that adopted the parallelprogrammingmethod ofMapReduce on a variety of learningalgorithms containing119870-Means Zhao et al [20] put forwarda PKMeans algorithm using MapReduce and Zhou et al [21]implemented an automatic classification of a large numberof documents by PKMeans Nguyen et al [22] introduced aparallel Two-Phase 119870-Means algorithm (Par2PK-Means) toovercome the limitation of available parallel versions and itachieves a better speedup than the sequential version

To the best of our knowledge all the aforementionedefforts are not successfully applied in solving traffic subareadivision problem with large-scale taxi trajectories Howeverwe propose a parallel clustering algorithm (Par3PKM) basedon MapReduce for parallel clustering of a large number ofGPS trajectories of taxicabs in this work Different fromexisting methods in Par3PKM we modify the distancemetric and initialization strategy of 119870-Means for improvingthe accuracy of clustering and implement parallel computingof iteration in three phases to enhance the efficiency ofcomputation Furthermore the addition of the Combinerfunction is employed to cut down the amount of datashuffled between theMap tasks and the Reduce tasks therebyreducing the computational complexity of MapReduce job

4 Discrete Dynamics in Nature and Society

Task with

Node 1 Node 2 Node nTaskTracker TaskTracker TaskTracker

Database DataNodeDataNode DataNodeDatabase Database

InputFormat

SQL query

SMS plannerMapReduce job

Hadoop coreMaster node

HDFS

NameNode

MapReduce job

Dat

a loa

der

Cata

log

JobTracker

Input format implementationsDatabase

MapReduce

connector

Figure 2 Architectures of HDFS and MapReduce

and saving the limited bandwidth available on a Hadoopcluster

23MapReduce Framework HadoopDistributed File System(HDFS) [38] and Hadoop MapReduce [39 40] are two corecomponents of Hadoop [41ndash43] based on the open-sourceimplementation of GFS [44] and MapReduce [45] And theirarchitectures are depicted in Figure 2 For further detailson Hadoop see the Apache Hadoop website (httphadoopapacheorg)

MapReduce is a parallel processing paradigm that allowsfor massive scalability across hundreds or thousands ofservers on a Hadoop cluster [46] and particularly providesefficient computing framework to deal with big taxi trajectorydata for traffic subarea division As a typicalmethodology theprocessing of MapReduce jobs includes the Map phase andthe Reduce phase Each phase has key-value pairs as input andoutput the types ofwhichmay be selected by the programmerthat specifies theMap function and the Reduce function [40]A simple example is illustrated in Figure 3 which shows thelogical data flow of a simple MapReduce job that calculatesthe maximum temperature for each year by mining hugeamounts of weather data sets

In thiswork based on aMapReduce framework the time-consuming iterations of the proposed Par3PKM algorithmare performed in three phases with the Map function theCombiner function and theReduce function and the parallelcomputing process of MapReduce is shown in Figure 4Specifically in Par3PKM the incremental Combiner functionis executed between the Map tasks and the Reduce taskswhich can reduce the computational complexity of MapRe-duce jobs and save the limited bandwidth available on aHadoop cluster

3 Motivation

In this section we describe the motivation of this work andgive a reasonable solution for solving traffic subarea divisionproblem

Naturally GPS-equipped taxi is an essential public traffictool inmodern citiesOver the last decade taxi trajectory datahave been exploding and have become the most importantmobile trajectory data source with the advantages of broadcovering extra precision excellent continuity little privacyand so forth Thus for solving traffic subarea division prob-lem how to substantially improve the capacity of processinglarge-scale GPS trajectories of taxicabs poses an urgentchallenge to us On the other hand as one of the most well-known clustering techniques in practice the traditional 119870-Means algorithm is a simple iterative approach but has veryhigh time complexity and memory consumption The timerequirements for119870-Means are basically linear in the numberof data objects The time required is 119874(119870 lowast 119868 lowast 119899 lowast119898) where119870 le 119899 119868 le 119899 119870 denotes the number of desired clusters119868 is the number of iterations required for convergence 119899 isthe total number of objects and 119898 represents the numberof attributes in the given data sets In particular the storagerequired is 119874((119870 + 119899) lowast 119898) Obviously the efficiency ofthe serial 119870-Means algorithm is low in handling a largenumber of taxi trajectories with limited memory on a singlemachine Hence to enhance the accuracy and efficiency ofclustering another challenge is how to optimize 119870-Meansand implement it in a MapReduce framework on a Hadoopplatform

These challenges motivate the development of thePar3PKM algorithm with the DTSAD method and a newsolution to the above problems is illustrated in Figure 5As shown in Figure 5 based on a Hadoop platform withMapReduce and ArcGIS the process of the DTSAD methodmainly includes the following steps First we preprocesscorrelation data extracted from large-scale taxi trajectoriesThen we cluster huge amounts of trajectory data in parallelusing the proposed Par3PKM algorithm as described inSection 4 Finally we identify the borders of clustering resultsto build each traffic subarea by our boundary identifyingmethod as depicted in Section 53

Clearly the key to this solution is the accuracy andefficiency of clustering large-scale taxi trajectories whichdetermines the overall performance of traffic subarea divi-sion Thus the aim of this paper is to put forward an efficientparallel clustering algorithm (Par3PKM) with MapReduceimplementation

4 The Proposed Par3PKM Algorithm

In this section the ParallelThree-Phase119870-Means (Par3PKM)algorithm is proposed for efficiently and accurately clusteringa large number of GPS trajectories of taxicabs under aMapReduce framework on Hadoop

41 Overview and Notation The process of the Par3PKMalgorithm is depicted in Figure 6 Based on a MapReduceframework the Par3PKM algorithm first chooses 119870 of the

Discrete Dynamics in Nature and Society 5

input map shuffle

sort

output

(2013 [12 46]) (2013 46) (2014 48)

(2013 12) (2013 46)

(2014 4 8)

2013 46 2014 48

reduce gt

0047014567

0053011990

0053211990

middot middot middotmiddot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

(0 0047014567 )

(106 0053011990 )

(212 0053211990 )(2014 minus11) (2014 [minus11 48])

reducerb gtmaprbcatlowast

Figure 3 Logical data flow of a simple MapReduce job

Split 0

Split 1

Split 2

Split 3

Split 4

Input files

WorkerWorker

WorkerWorker

Worker

Map phase

Userprogram

Master

Combine phase Reduce phase Output files

(3) Read

(1) Fork (1) Fork

(1) Fork(2) Assign Map

(4) Local

(2) Assign

(5) Remote (6) Write

Reduce

Read

Write

Output

OutputFile 0

File 1

Figure 4 Parallel computing process of MapReduce

DTSAD method

Large-scale taxi trajectories

Data preprocessing

Parallel clustering

Boundary identifying

Traffic subarea

Hadoop with MapReduce

Division framework

ArcGIS

Figure 5 Process and framework of the DTSAD method

objects as initial centroids where119870 is the number of desiredclusters Then each of the remaining objects is assigned

Input

Initialize centroids

Parallel executionof iteration

Compute the distance between eachobject and each cluster center

Assign each object to its closest centroid

Recompute the centroid of each cluster

Criterion function

converges

OutputYes

No

Select K

Figure 6 Process of the Par3PKM algorithm

to the cluster to which it is the most similar on the basisof the distance between the object and the cluster centerFinally it computes the new center for each cluster and this

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 3

algorithm is described in detail Section 5 presents theapplications of our approach and then the division resultsare analyzed In Section 6 we evaluate the performance ofthe proposed algorithm and discuss the experimental resultsand then conclude the paper in Section 7

2 Related Work

In this section we first briefly review related work on thetraffic subarea division and the parallel 119870-Means algorithmsand then overview the MapReduce framework used in thispaper

21 Traffic Subarea Division The concept of traffic subareawas first proposed by Walinchus [23] and various methodsof traffic subarea division were subsequently developed forvarious ITSs applications including easing traffic congestionWong et al [24] presented a time-dependent TRANSYT(Traffic Network Study Tool) traffic model for area trafficcontrol and Robertson and Bretherton [25] introduced aSCOOT (Split Cycle Offset Optimization Technique)methodto optimize networks of traffic signals in real time Maand Yang [26] designed an expert system of traffic subareadivision to trade off several demands for managing trafficnetwork and reducing the complexity of traffic control basedon an integrated correlation index Lu et al [27 28] builta model of partitioning traffic control subarea based oncorrelation degree analysis and optimized the strategy ofsubarea division Guo et al [29] provided a dynamic trafficcontrol subarea division method according to the similarityof adjacent intersections

Furthermore some researchers put forward the dimen-sion-reduced processing and genetic algorithms to optimizesubarea division Li et al [30] proposed a method to dividetraffic control subarea dynamically on the basis of BackPropagation (BP) neural networkThismethod divides trafficsubarea by considering traffic flow distance of intersectionsand cycle In addition to improve the performance of large-scale urban traffic networks division Zhou et al [31] pre-sented a new approach to calculating the correlation degreewhich determines the desire for interconnection between twoadjacent intersections

Obviously all the solutions mentioned above have manydesirable properties but may have difficulty in processing alarge number of taxi trajectories In this work we presenta distributed traffic subarea division (DTSAD) method toimprove the accuracy and efficiency of division using large-scale GPS trajectories of taxicabs on a Hadoop platform withMapReduce and ArcGIS

22 Parallel 119870-Means Algorithms The 119870-Means algorithmproposed by MacQueen [32] is the most popular clusteringmethod to partition a given data set into a user-specifiednumber of clusters (ie 119870) [33ndash35] and needs several iter-ations over data sets before converging on a solution The119870-Means algorithm often converges to a local optimumTo address this problem an incremental 119870-Means (IKM)algorithm [36] was put forward to empirically reach theglobal optimum but has a higher complexity Nevertheless

the traditional 119870-Means algorithm requires several scansover data sets which have to be fully loaded into memoryin order to enhance the efficiency of data accessing but thisrequirement is hard to fulfill for handling massive data [34]To overcome this drawback Two-Phase 119870-Means [37] wasintroduced and can robustly find a good solution in oneiteration by employing a buffering technique but it is possiblyunsuited to processing huge amounts of data

Subsequently Kantabutra and Couch [14] presented aparallel 119870-Means algorithm on a Message-Passing Interface(MPI) framework of a network of workstations (NOWs)and the main idea is that all data objects of a cluster arestored in a slave node Data rearrangement of this methodneeds large-scale data transmission between slaves whichmakes it difficult to deal with big data sets Zhang et al [15]implemented parallel 119870-Means based on a Parallel VirtualMachine (PVM) framework and performed on NOWs andthis parallel method requires the full load of the data sets onthe master node and synchronization of data at the end ofan iteration Kraj et al [16] proposed a parallel ParaKMeansalgorithm which makes good use of the multithreadingmethod on a single machine to accomplish cluster andemploys sufficient statistics to measure the quality of clusterin the stop condition Pakhira [17] introduced a distributed119870-Means algorithm which was executed on multiprocessormachines through randomly delivering the split data subsetto each processor Kohlhoff et al [18] developed parallel 119870-Means (119870119901119904-Means) with GPUs implementation which isefficient regardless of the dimensionality of the input dataBecause the communication of different slaves takes up alarge number of IO resources and consumes huge amountsof time these methods might have difficulty in dealing withlarge-scale data

Moreover to cluster big data effectively some researchersimplemented the parallelismof the119870-Means algorithmbasedon a MapReduce framework Chu et al [19] developed apragmatic and general framework that adopted the parallelprogrammingmethod ofMapReduce on a variety of learningalgorithms containing119870-Means Zhao et al [20] put forwarda PKMeans algorithm using MapReduce and Zhou et al [21]implemented an automatic classification of a large numberof documents by PKMeans Nguyen et al [22] introduced aparallel Two-Phase 119870-Means algorithm (Par2PK-Means) toovercome the limitation of available parallel versions and itachieves a better speedup than the sequential version

To the best of our knowledge all the aforementionedefforts are not successfully applied in solving traffic subareadivision problem with large-scale taxi trajectories Howeverwe propose a parallel clustering algorithm (Par3PKM) basedon MapReduce for parallel clustering of a large number ofGPS trajectories of taxicabs in this work Different fromexisting methods in Par3PKM we modify the distancemetric and initialization strategy of 119870-Means for improvingthe accuracy of clustering and implement parallel computingof iteration in three phases to enhance the efficiency ofcomputation Furthermore the addition of the Combinerfunction is employed to cut down the amount of datashuffled between theMap tasks and the Reduce tasks therebyreducing the computational complexity of MapReduce job

4 Discrete Dynamics in Nature and Society

Task with

Node 1 Node 2 Node nTaskTracker TaskTracker TaskTracker

Database DataNodeDataNode DataNodeDatabase Database

InputFormat

SQL query

SMS plannerMapReduce job

Hadoop coreMaster node

HDFS

NameNode

MapReduce job

Dat

a loa

der

Cata

log

JobTracker

Input format implementationsDatabase

MapReduce

connector

Figure 2 Architectures of HDFS and MapReduce

and saving the limited bandwidth available on a Hadoopcluster

23MapReduce Framework HadoopDistributed File System(HDFS) [38] and Hadoop MapReduce [39 40] are two corecomponents of Hadoop [41ndash43] based on the open-sourceimplementation of GFS [44] and MapReduce [45] And theirarchitectures are depicted in Figure 2 For further detailson Hadoop see the Apache Hadoop website (httphadoopapacheorg)

MapReduce is a parallel processing paradigm that allowsfor massive scalability across hundreds or thousands ofservers on a Hadoop cluster [46] and particularly providesefficient computing framework to deal with big taxi trajectorydata for traffic subarea division As a typicalmethodology theprocessing of MapReduce jobs includes the Map phase andthe Reduce phase Each phase has key-value pairs as input andoutput the types ofwhichmay be selected by the programmerthat specifies theMap function and the Reduce function [40]A simple example is illustrated in Figure 3 which shows thelogical data flow of a simple MapReduce job that calculatesthe maximum temperature for each year by mining hugeamounts of weather data sets

In thiswork based on aMapReduce framework the time-consuming iterations of the proposed Par3PKM algorithmare performed in three phases with the Map function theCombiner function and theReduce function and the parallelcomputing process of MapReduce is shown in Figure 4Specifically in Par3PKM the incremental Combiner functionis executed between the Map tasks and the Reduce taskswhich can reduce the computational complexity of MapRe-duce jobs and save the limited bandwidth available on aHadoop cluster

3 Motivation

In this section we describe the motivation of this work andgive a reasonable solution for solving traffic subarea divisionproblem

Naturally GPS-equipped taxi is an essential public traffictool inmodern citiesOver the last decade taxi trajectory datahave been exploding and have become the most importantmobile trajectory data source with the advantages of broadcovering extra precision excellent continuity little privacyand so forth Thus for solving traffic subarea division prob-lem how to substantially improve the capacity of processinglarge-scale GPS trajectories of taxicabs poses an urgentchallenge to us On the other hand as one of the most well-known clustering techniques in practice the traditional 119870-Means algorithm is a simple iterative approach but has veryhigh time complexity and memory consumption The timerequirements for119870-Means are basically linear in the numberof data objects The time required is 119874(119870 lowast 119868 lowast 119899 lowast119898) where119870 le 119899 119868 le 119899 119870 denotes the number of desired clusters119868 is the number of iterations required for convergence 119899 isthe total number of objects and 119898 represents the numberof attributes in the given data sets In particular the storagerequired is 119874((119870 + 119899) lowast 119898) Obviously the efficiency ofthe serial 119870-Means algorithm is low in handling a largenumber of taxi trajectories with limited memory on a singlemachine Hence to enhance the accuracy and efficiency ofclustering another challenge is how to optimize 119870-Meansand implement it in a MapReduce framework on a Hadoopplatform

These challenges motivate the development of thePar3PKM algorithm with the DTSAD method and a newsolution to the above problems is illustrated in Figure 5As shown in Figure 5 based on a Hadoop platform withMapReduce and ArcGIS the process of the DTSAD methodmainly includes the following steps First we preprocesscorrelation data extracted from large-scale taxi trajectoriesThen we cluster huge amounts of trajectory data in parallelusing the proposed Par3PKM algorithm as described inSection 4 Finally we identify the borders of clustering resultsto build each traffic subarea by our boundary identifyingmethod as depicted in Section 53

Clearly the key to this solution is the accuracy andefficiency of clustering large-scale taxi trajectories whichdetermines the overall performance of traffic subarea divi-sion Thus the aim of this paper is to put forward an efficientparallel clustering algorithm (Par3PKM) with MapReduceimplementation

4 The Proposed Par3PKM Algorithm

In this section the ParallelThree-Phase119870-Means (Par3PKM)algorithm is proposed for efficiently and accurately clusteringa large number of GPS trajectories of taxicabs under aMapReduce framework on Hadoop

41 Overview and Notation The process of the Par3PKMalgorithm is depicted in Figure 6 Based on a MapReduceframework the Par3PKM algorithm first chooses 119870 of the

Discrete Dynamics in Nature and Society 5

input map shuffle

sort

output

(2013 [12 46]) (2013 46) (2014 48)

(2013 12) (2013 46)

(2014 4 8)

2013 46 2014 48

reduce gt

0047014567

0053011990

0053211990

middot middot middotmiddot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

(0 0047014567 )

(106 0053011990 )

(212 0053211990 )(2014 minus11) (2014 [minus11 48])

reducerb gtmaprbcatlowast

Figure 3 Logical data flow of a simple MapReduce job

Split 0

Split 1

Split 2

Split 3

Split 4

Input files

WorkerWorker

WorkerWorker

Worker

Map phase

Userprogram

Master

Combine phase Reduce phase Output files

(3) Read

(1) Fork (1) Fork

(1) Fork(2) Assign Map

(4) Local

(2) Assign

(5) Remote (6) Write

Reduce

Read

Write

Output

OutputFile 0

File 1

Figure 4 Parallel computing process of MapReduce

DTSAD method

Large-scale taxi trajectories

Data preprocessing

Parallel clustering

Boundary identifying

Traffic subarea

Hadoop with MapReduce

Division framework

ArcGIS

Figure 5 Process and framework of the DTSAD method

objects as initial centroids where119870 is the number of desiredclusters Then each of the remaining objects is assigned

Input

Initialize centroids

Parallel executionof iteration

Compute the distance between eachobject and each cluster center

Assign each object to its closest centroid

Recompute the centroid of each cluster

Criterion function

converges

OutputYes

No

Select K

Figure 6 Process of the Par3PKM algorithm

to the cluster to which it is the most similar on the basisof the distance between the object and the cluster centerFinally it computes the new center for each cluster and this

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

4 Discrete Dynamics in Nature and Society

Task with

Node 1 Node 2 Node nTaskTracker TaskTracker TaskTracker

Database DataNodeDataNode DataNodeDatabase Database

InputFormat

SQL query

SMS plannerMapReduce job

Hadoop coreMaster node

HDFS

NameNode

MapReduce job

Dat

a loa

der

Cata

log

JobTracker

Input format implementationsDatabase

MapReduce

connector

Figure 2 Architectures of HDFS and MapReduce

and saving the limited bandwidth available on a Hadoopcluster

23MapReduce Framework HadoopDistributed File System(HDFS) [38] and Hadoop MapReduce [39 40] are two corecomponents of Hadoop [41ndash43] based on the open-sourceimplementation of GFS [44] and MapReduce [45] And theirarchitectures are depicted in Figure 2 For further detailson Hadoop see the Apache Hadoop website (httphadoopapacheorg)

MapReduce is a parallel processing paradigm that allowsfor massive scalability across hundreds or thousands ofservers on a Hadoop cluster [46] and particularly providesefficient computing framework to deal with big taxi trajectorydata for traffic subarea division As a typicalmethodology theprocessing of MapReduce jobs includes the Map phase andthe Reduce phase Each phase has key-value pairs as input andoutput the types ofwhichmay be selected by the programmerthat specifies theMap function and the Reduce function [40]A simple example is illustrated in Figure 3 which shows thelogical data flow of a simple MapReduce job that calculatesthe maximum temperature for each year by mining hugeamounts of weather data sets

In thiswork based on aMapReduce framework the time-consuming iterations of the proposed Par3PKM algorithmare performed in three phases with the Map function theCombiner function and theReduce function and the parallelcomputing process of MapReduce is shown in Figure 4Specifically in Par3PKM the incremental Combiner functionis executed between the Map tasks and the Reduce taskswhich can reduce the computational complexity of MapRe-duce jobs and save the limited bandwidth available on aHadoop cluster

3 Motivation

In this section we describe the motivation of this work andgive a reasonable solution for solving traffic subarea divisionproblem

Naturally GPS-equipped taxi is an essential public traffictool inmodern citiesOver the last decade taxi trajectory datahave been exploding and have become the most importantmobile trajectory data source with the advantages of broadcovering extra precision excellent continuity little privacyand so forth Thus for solving traffic subarea division prob-lem how to substantially improve the capacity of processinglarge-scale GPS trajectories of taxicabs poses an urgentchallenge to us On the other hand as one of the most well-known clustering techniques in practice the traditional 119870-Means algorithm is a simple iterative approach but has veryhigh time complexity and memory consumption The timerequirements for119870-Means are basically linear in the numberof data objects The time required is 119874(119870 lowast 119868 lowast 119899 lowast119898) where119870 le 119899 119868 le 119899 119870 denotes the number of desired clusters119868 is the number of iterations required for convergence 119899 isthe total number of objects and 119898 represents the numberof attributes in the given data sets In particular the storagerequired is 119874((119870 + 119899) lowast 119898) Obviously the efficiency ofthe serial 119870-Means algorithm is low in handling a largenumber of taxi trajectories with limited memory on a singlemachine Hence to enhance the accuracy and efficiency ofclustering another challenge is how to optimize 119870-Meansand implement it in a MapReduce framework on a Hadoopplatform

These challenges motivate the development of thePar3PKM algorithm with the DTSAD method and a newsolution to the above problems is illustrated in Figure 5As shown in Figure 5 based on a Hadoop platform withMapReduce and ArcGIS the process of the DTSAD methodmainly includes the following steps First we preprocesscorrelation data extracted from large-scale taxi trajectoriesThen we cluster huge amounts of trajectory data in parallelusing the proposed Par3PKM algorithm as described inSection 4 Finally we identify the borders of clustering resultsto build each traffic subarea by our boundary identifyingmethod as depicted in Section 53

Clearly the key to this solution is the accuracy andefficiency of clustering large-scale taxi trajectories whichdetermines the overall performance of traffic subarea divi-sion Thus the aim of this paper is to put forward an efficientparallel clustering algorithm (Par3PKM) with MapReduceimplementation

4 The Proposed Par3PKM Algorithm

In this section the ParallelThree-Phase119870-Means (Par3PKM)algorithm is proposed for efficiently and accurately clusteringa large number of GPS trajectories of taxicabs under aMapReduce framework on Hadoop

41 Overview and Notation The process of the Par3PKMalgorithm is depicted in Figure 6 Based on a MapReduceframework the Par3PKM algorithm first chooses 119870 of the

Discrete Dynamics in Nature and Society 5

input map shuffle

sort

output

(2013 [12 46]) (2013 46) (2014 48)

(2013 12) (2013 46)

(2014 4 8)

2013 46 2014 48

reduce gt

0047014567

0053011990

0053211990

middot middot middotmiddot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

(0 0047014567 )

(106 0053011990 )

(212 0053211990 )(2014 minus11) (2014 [minus11 48])

reducerb gtmaprbcatlowast

Figure 3 Logical data flow of a simple MapReduce job

Split 0

Split 1

Split 2

Split 3

Split 4

Input files

WorkerWorker

WorkerWorker

Worker

Map phase

Userprogram

Master

Combine phase Reduce phase Output files

(3) Read

(1) Fork (1) Fork

(1) Fork(2) Assign Map

(4) Local

(2) Assign

(5) Remote (6) Write

Reduce

Read

Write

Output

OutputFile 0

File 1

Figure 4 Parallel computing process of MapReduce

DTSAD method

Large-scale taxi trajectories

Data preprocessing

Parallel clustering

Boundary identifying

Traffic subarea

Hadoop with MapReduce

Division framework

ArcGIS

Figure 5 Process and framework of the DTSAD method

objects as initial centroids where119870 is the number of desiredclusters Then each of the remaining objects is assigned

Input

Initialize centroids

Parallel executionof iteration

Compute the distance between eachobject and each cluster center

Assign each object to its closest centroid

Recompute the centroid of each cluster

Criterion function

converges

OutputYes

No

Select K

Figure 6 Process of the Par3PKM algorithm

to the cluster to which it is the most similar on the basisof the distance between the object and the cluster centerFinally it computes the new center for each cluster and this

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 5

input map shuffle

sort

output

(2013 [12 46]) (2013 46) (2014 48)

(2013 12) (2013 46)

(2014 4 8)

2013 46 2014 48

reduce gt

0047014567

0053011990

0053211990

middot middot middotmiddot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

(0 0047014567 )

(106 0053011990 )

(212 0053211990 )(2014 minus11) (2014 [minus11 48])

reducerb gtmaprbcatlowast

Figure 3 Logical data flow of a simple MapReduce job

Split 0

Split 1

Split 2

Split 3

Split 4

Input files

WorkerWorker

WorkerWorker

Worker

Map phase

Userprogram

Master

Combine phase Reduce phase Output files

(3) Read

(1) Fork (1) Fork

(1) Fork(2) Assign Map

(4) Local

(2) Assign

(5) Remote (6) Write

Reduce

Read

Write

Output

OutputFile 0

File 1

Figure 4 Parallel computing process of MapReduce

DTSAD method

Large-scale taxi trajectories

Data preprocessing

Parallel clustering

Boundary identifying

Traffic subarea

Hadoop with MapReduce

Division framework

ArcGIS

Figure 5 Process and framework of the DTSAD method

objects as initial centroids where119870 is the number of desiredclusters Then each of the remaining objects is assigned

Input

Initialize centroids

Parallel executionof iteration

Compute the distance between eachobject and each cluster center

Assign each object to its closest centroid

Recompute the centroid of each cluster

Criterion function

converges

OutputYes

No

Select K

Figure 6 Process of the Par3PKM algorithm

to the cluster to which it is the most similar on the basisof the distance between the object and the cluster centerFinally it computes the new center for each cluster and this

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

6 Discrete Dynamics in Nature and Society

Map function

Large-scale taxi

trajectory data sets

Compute the distancebetween each data objectand each cluster center

Select the centroidpoints with the

shortest distance

Combiner function

Merge data objectsbelonging to the same

cluster center

Calculate the sum of thevalues of data objects

assigned to the same cluster

Reduce function

Aggregate the localresults of all the clusters

Compute the new clustercenter for each cluster

Output the localresults of each cluster

Output theintermediate data

Update the centroids

Criterion functionconvergence

Output the final resultsNo

Yes

Figure 7 Parallel execution process of iteration

process iterates until the criterion function converges Theparallel execution of iteration is illustrated in Figure 7 and itsMapReduce implementation will be described in Section 43in detail

Typically the squared-error criterion [33] is used tomeasure the quality of clustering Let 119883 = 119909119894 | 119894 = 1 119899

be the set of 119899 119898-dimensional vectors to be clustered into aset of 119870 clusters 119862 = 119888119896 | 119896 = 1 119870 Let 120583119896 be the meanof cluster 119888119896

The squared-error criterion between 120583119896 and the givenobject in cluster 119888119896 is defined as

119869 (119888119896) = sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (1)

The goal of the Par3PKM algorithm is to minimize thesum of the squared errors (SSE) over all the 119870 clusters andSSE is given by the following equation

SSE = 119869 (119862) =

119870

sum

119896=1

sum

119909119894isin119888119896

1003817100381710038171003817119909119894 minus 1205831198961003817100381710038171003817

2 (2)

42 Distance Measure and Cluster Initialization It is wellknown that the 119870-Means algorithm is sensitive to distancemeasure and cluster initialization To overcome the criticallimitations we optimize the distancemetric and initializationstrategy for improving the accuracy and efficiency of cluster-ing in the proposed Par3PKM algorithm

421 Distance Metric The traditional 119870-Means algorithmtypically employs Euclidean metric to compute the dis-tance between objects and cluster centers In the proposedPar3PKM algorithm we attempt to adopt two rules usinga statistical approach [47] in terms of distance measure

selection which is more appropriate for large-scale trajectorydata sets

Two rules of distance measure are illustrated as follows

(i) If 120581 le 846 square Euclidean distance (see (5)) ischosen as the distance measure of Par3PKM

(ii) If 120581 gt 846 Manhattan distance (see (6)) is selected asthe distance measure of Par3PKM

Here 120581 is the kurtosis which measures the tail heavinessIt is defined as

120581 =

(1119899)sum119899

119894=1(119909119894 minus 119883)

4

1205904

(3)

where 119899 represents the sequence length and sample mean 119883

and sample standard deviation 120590 are given by

119883 =1

119899

119899

sum

119894=1

119909119894

120590 = (1

119899 minus 1

119899

sum

119894=1

(119909119894 minus 119883)2

)

12

(4)

The square Euclidean distance 1198892

119890and the Manhattan

distance 119889119898 are respectively given by

1198892

119890(119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816

2

(5)

119889119898 (119909119894 120583119896) =

119898

sum

119895=1

10038161003816100381610038161003816119909119894119895 minus 120583119896119895

10038161003816100381610038161003816 (6)

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 7

where 119909119894 = (1199091198941 1199091198942 119909119894119898) and 120583119896 = (1205831198961 1205831198962 120583119896119898) aretwo119898-dimensional data objects

The Par3PKM algorithm achieves more accurate clus-tering performance than the 119870-Means algorithm via thestatistical based distance measure method which will beestimated in Section 6

422 Initialization Strategy Different initialization strate-gies will lead to different efficiencies In our Par3PKMthree initialization strategies are developed for enhancingthe efficiency of clustering To overcome the local minimaof Par3PKM for a given 119870 with several different initialpartitions we select the partition with the smallest value ofthe squared error Furthermore inspired by [48] we take themutual farthest data objects in the high density area as theinitial centroids for Par3PKM The method of obtaining theinitial centroids from the high density area was introduced in[49] In addition both pre- and postprocessing on Par3PKMare taken into consideration for example we remove outliersin a preprocessing step and eliminate small cluster andormerge close clusters into a large cluster in postprocessing theresults

From the experimental evaluations as described in Sec-tion 6 we can observe that the implemented initializationstrategies shorten the execution time of Par3PKM with thesame clustering results

43 MapReduce Implementation To improve the efficiencyand scalability of clustering we implement the Par3PKMalgorithm in the MapReduce model of computation Thetasks of Par3PKMaremainly composed of the following threeaspects

(i) Compute the distance between the object and thecluster center

(ii) Assign each object to its closest centroid(iii) Recompute new cluster centers for each cluster

In this work we accomplish the aforementioned tasks ona Hadoop platform using MapReduce First we select119870 dataobjects (vectors) as the initial cluster centers from the inputdata sets which are stored in HDFS Then we update thecentroid of each cluster after iterating As depicted in Figure 7the parallel execution of iteration is implemented by the Mapfunction the Combiner function and the Reduce functionin three phases (ie Map phase Combine phase and Reducephase) respectively

431 Map Phase In this phaseMap task receives each line inthe sequence file as different key-value pairs which forms theinput to the Map function The Map function first computesthe distance between each data object and each cluster centerthen assigns each object to its closest centroid according tothe shortest distance and finally outputs the intermediatedata to the Combiner functionTheMap function is formallydepicted in Algorithm 1

432 Combine Phase In this phase the Combiner functionfirst extracts all the data objects from V1198861198971199061198901 which is

the output of the Map function and merges data objectsbelonging to the same cluster center Next it calculates thesum of the values of data objects assigned to the same clusterand records the number of samples in the same cluster so asto compute the mean value of data objects Finally it outputsthe local results of each cluster to the Reduce function TheCombiner function is formally described in Algorithm 2

433 Reduce Phase In this phase the Reduce functionextracts all the data objects from V1198861198971199061198902 which is the outputof the Combiner function and aggregates the local resultsof all the clusters Then it computes the new cluster centerfor each cluster After that it judges whether the criterionfunction converges Finally it outputs the final results if itsargument is true and executes next iteration otherwise TheReduce function is formally illustrated in Algorithm 3

44 Complexity Analysis MapReduce is a programmingmodel for large-scale data processing and MapReduce pro-grams especially are inherently parallel Thus the Par3PKMalgorithm with MapReduce implementation puts large num-bers of computational tasks into different nodes Accordingto this parallel processing paradigm the time complexity ofPar3PKM is 119874(119870 lowast 119868 lowast 119899 lowast 119898)119901 lowast 119902 where 119901 is the numberof nodes 119902 is the number of Map tasks in one node and119870 119868 119899 and 119898 are explained in Section 3 Moreover thespace complexity of Par3PKM is 119874((119870 + 119899) lowast 119898)119901 Only thedata objects and the centroids are stored in each slave noderespectively thus the space requirements for Par3PKM aremodest

In comparison with the traditional 119870-Means algorithmthe Par3PKM algorithm improves the efficiency of clusteringthrough the following rate equation

119874imp = (1 minus1

119901 lowast 119902) times 100 (7)

Based on the above analyses we can conclude that thePar3PKM algorithm is relatively scalable and efficient in clus-tering large-scale data sets with the desired computationalcomplexity

5 Case Study

In this section we apply the proposed approach to dividetraffic subarea of Beijing using large-scale taxi trajectories andthen analyze the results

51 Data Sets and Framework In this work we divide trafficsubarea of Beijing based on real-world trajectory data sets(httpwwwdatatangcom) which contains a large numberof GPS trajectories recorded by 12000 taxicabs during aperiod of 30 days in November 2012The total distance of thedata sets is more than 50 million kilometers and the total sizeis 50GB Particularly the total number of GPS points reaches969 million

Additionally we perform the case based on a cluster ofHadoop with MapReduce (as described in Section 61) andArcGIS using the road network of Beijing which consists of106579 road nodes and 141380 road segments

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

8 Discrete Dynamics in Nature and Society

Inputkey the offsetvalue the samplecentroids the global variable

Output ⟨1198961198901199101 V1198861198971199061198901⟩key1 the index of the closest centroidvalue1 the information of sample

(1) Construct a global variable centroids including the information of the closet centroid point(2) Construct the sample examples to extract the data objects from value(3) min Dis = DoubleMAX VALUE(4) index = minus1(5) for 119894 = 0 to centroidslength do(6) distance = DistanceFunction(examples centroids[119894])(7) if distance ltmin Dis then(8) min Dis = distance(9) index = 119894(10) end if(11) end for(12) index = key1(13) Construct value1 as a string consisting of the values of different dimensions(14) return ⟨1198961198901199101 V1198861198971199061198901⟩ pairs

Algorithm 1 Map(key value)

Inputkey1 the index of the clustermedi the list of the samples assigned to the same cluster

Output ⟨1198961198901199102 V1198861198971199061198902⟩key2 the index of the clustervalue2 the sum of the values of the samples belonging to the same cluster and the number of samples

(1) Construct a counter num s to record the number of samples in the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples belonging

to the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions to obtain

the dimension of the original data object(4) num s = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) num s++(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) Calculate the sum of the values of each dimension of examples(11) end for(12) for 119894 = 0 to dimensions do(13) mean[119894] = sum v[119894]num s(14) Compute the mean value of the samples for each cluster(15) end for(16) end while(17) index = key2(18) Construct value2 as a string containing the sum of the values of each dimension sum v[119894] and

the number of samples num s(19) return ⟨1198961198901199102 V1198861198971199061198902⟩ pairs

Algorithm 2 Combiner(key1medi)

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 9

Inputkey2 the index of the clustermedi the list of the local sums from different clusters

Output ⟨1198961198901199103 V1198861198971199061198903⟩key3 the index of the clustervalue3 the new cluster center

(1) Construct a counter Num to record the total number of samples belonging to the same cluster(2) Construct an array sum v to record the sum of the values of different dimensions of the samples

in the same cluster (ie the samples in the listmedi)(3) Construct the sample examples to extract the data objects frommedinext() and the dimensions

to obtain the dimension of the original data object(4) Num = 0(5) while (medihasNext()) do(6) CurrentPoint =medinext()(7) Num+ = num s(8) for 119894 = 0 to dimensions do(9) sum v[119894]+ = CurrentPointpoint[119894](10) end for(11) for 119894 = 0 to dimensions do(12) mean[119894] = sum v[119894]Num(13) Obtain the new cluster center(14) end for(15) end while(16) index = key3(17) Construct value3 as a string composed of the new cluster center(18) return ⟨1198961198901199103 V1198861198971199061198903⟩ pairs

Algorithm 3 Reduce(key2medi)

Area A

Area A

Figure 8 Clustering results of large-scale taxi trajectories

52 Parallel Clustering After data preprocessing with theDTSAD method we extract the relevant attributes (eglongitude latitude) of the GPS trajectory records where thepassengers pick up or drop off taxis from the aforemen-tioned data sets Then on a Hadoop cluster with MapRe-duce we cluster the extracted trajectory data sets through

the Par3PKM algorithm (as depicted in Section 4) with 119870 =

100 which is the number of desired clusters Finally basedon the ArcGIS platform with the road network of Beijing weplot the results in Figure 8

As illustrated in Figure 8 large-scale taxi trajectories areclustered in different areas respectively and each area with

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

10 Discrete Dynamics in Nature and Society

360n

(a)

P

(b) (c)

Figure 9 Process of the boundary identifying method (a) Division of coordinate system (b) selection of border points and (c) connectionof border points

different colors represents a cluster Each cluster (eg AreaA) has obvious characteristics of traffic condition such as theflow of people and automobile which is high in these areasin comparison with real traffic map and traffic condition ofBeijing

53 Boundary Identifying On the ArcGIS platform we havedifficulty in identifying the borders of each cluster Howeverwe have to connect the borders of each cluster in order toaccurately form traffic subarea via our boundary identifyingmethod which is described in Figure 9 As illustrated in Fig-ure 9 the boundary identifying method is mainly composedof the following three steps

Step 1 As shown in Figure 9(a) we build a coordinate systemwhich is equally divided into 119899 parts through taking (0 0) asthe origin of coordinates

Step 2 We match each cluster center to the origin of coordi-nates and then map other points to the coordinate system inthe same cluster Finally the farthest points of each part areselected (eg 119875 in Figure 9(b))

Step 3 As depicted in Figure 9(c) we connect these selectedpoints of each part and then obtain a subarea

Using the presented boundary identifying method withthe clustering results of Par3PKM we plot the division resultsof traffic subarea in Figure 10 As described in Figure 10 AreaA is a typical traffic subarea shown in the lower right cornerof the graph which includes Tsinghua University PekingUniversity Renmin University of China Beihang Universityand Zhongguancun Area B is composed of the BeijingWorkersrsquo Gymnasium Blue Zoo Beijing Children OutdoorPaopao Paradise and so forth Area C consists of BeijingNorth Railway Station Beijing Exhibition Center CapitalIndoor Stadium and so forth Area D contains OlympicSports Center Stadium Beijing Olympic Park NationalIndoor Stadium Beijing National Aquatics Center BeijingInternational Convention Center and so forth

54 Analysis of Results According to the division results wecan observe that some areas with similar traffic conditions aredivided into the same traffic subarea such as the TsinghuaUniversity and the Peking University and the OlympicSports Center Stadium and the National Indoor Stadium Incontrast the Blue Zoo Beijing and the Beijing ExhibitionCenter and the Beijing North Railway Station and BeijingInternational Convention Center are classified into differenttraffic subareas That is because the different regions of atraffic subarea have great similarities and correlations intraffic condition business pattern and other aspects

Based on the above analysis we conclude that theproposed Par3PKM algorithm can efficiently cluster bigtrajectory data on a Hadoop cluster using MapReduceMoreover our boundary identifying method can accuratelyconnect the borders of clustering results for each clusterIn particular the division results are consistent with thereal traffic condition of the corresponding areas in BeijingOverall the results demonstrate that Par3PKM combinedwith DTSAD is a promising alternative for traffic subareadivision with large-scale taxi trajectories and thus can reducethe complexity of traffic planningmanagement and analysisMore importantly it can provide helpful decision-making forbuilding ITSs

6 Evaluation and Discussion

In this section we evaluate the accuracy efficiency speedupscale-up and reliability of the Par3PKM algorithm via theextensive experiments on real and synthetic data and thendiscuss the performance results

61 Evaluation Setup The experimental platform based on aHadoop cluster which is composed of one master machineand eight slave machines with Intel Xeon E7-4820 200GHzCPU (4-core) and 800GB RAM All the experiments areperformed on Ubuntu 1204 OS with Hadoop 104 and JDK160

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 11

Area AArea B

Area C

Area A

Area D

Figure 10 Division results of traffic subarea

Table 1 Data sets of the experimental evaluations

Name Number of instances Number of attributesIris 150 4Habermanrsquos Survival 306 3Ecoli 336 8Hayes-Roth 160 5Lenses 24 4Wine 178 13

In addition to a real taxi trajectory data set (as describedin Section 51) we use six synthetic data sets (as shown inTable 1) selected from the UCIMachine Learning Repository(httparchiveicsuciedumldatasetshtml) to evaluate theperformance of the Par3PKM algorithm in comparison with119870-Means Par2PK-Means and ParCLARA which is thebest-known 119870-medoid algorithm CLARA (Clustering LargeApplications) [50] with MapReduce implementation Mean-while each data set is processed into 80MB 160MB 320MB640MB 1280MB and 2560MB so as to further verify theefficiency of the proposed algorithm Also we handle sevendata sets into 160MB 320MB 480MB 640MB 800MB960MB 1120MB and 1280MB for validating the scale-up ofour algorithm

62 Evaluation on Efficiency We perform efficiency exper-iments where we execute Par3PKM Par2PK-Means andParCLARA in the parallel environment with eight nodesand 119870-Means in the single machine environment usingseven data sets with different sizes (varying from 80MB to2560MB) respectively in order to demonstrate whether the

Par3PKM algorithm can process larger data sets and hashigher efficiency The experimental results are respectivelyshown in Table 2 and Figure 11

As depicted in Figure 11 the 119870-Means algorithm can-not process over 1280MB data sets in the single machineenvironment on account of the memory overflow and thusthe graph does not present the corresponding executiontime of119870-Means in this experiment However the Par3PKMalgorithm can effectively handle more than 1280 MB datasets even larger data in the parallel environment (ie in ldquoBigDatardquo environment) In particular the Par3PKM algorithmhas higher efficiency than the ParCLARA algorithm and thePar2PK-Means algorithm with the improvement and theparallelism of the 119870-Means algorithm such as the additionof the Combiner function in the Combine phase To reducethe computational complexity of MapReduce job and savethe limited bandwidth available on a Hadoop cluster theCombiner function of the Par3PKMalgorithm is employed tocut down the amount of data shuffled between the Map tasksand the Reduce tasks and is specified to be run on the Mapoutput and its output forms the input to the Reduce function

At the same time we can find that the execution time ofthe 119870-Means algorithm is shorter than that of the Par3PKMalgorithm in clustering small-scale data sets The reasonis that the communication and interaction of each nodeconsume a certain amount of time in the parallel environment(eg the start of Job and Task tasks the communicationbetween NameNode and DataNodes) thereby leading to theexecution time being much longer than the actual computa-tion time of the Par3PKM algorithm More importantly wecan also observe that the efficiency of the Par3PKMalgorithmis improving multiply and the superiority is more markedwith the gradually increasing sizes of data sets

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

12 Discrete Dynamics in Nature and Society

Table 2 Execution time comparison on seven data sets

Data sets Size (MB) Execution time (s)119870-Means ParCLARA Par2PK-Means Par3PKM

Taxi Trajectory

80 75 662 439 312160 103 756 486 371320 195 893 579 406640 1125 1173 838 6141280 2373 1679 1348 10262560 mdash 2721 2312 1879

Iris

80 32 612 301 213160 54 693 380 279320 116 812 440 306640 685 1045 630 4031280 1800 1248 1005 7682560 mdash 2463 2013 1420

Habermanrsquos Survival

80 56 576 311 296160 60 675 400 324320 130 823 470 378640 720 987 670 4261280 2010 1321 1200 8732560 mdash 2449 2200 1719

Ecoli

80 38 628 330 283160 66 712 400 324320 130 835 460 375640 756 1104 700 4821280 1912 1636 1234 7632560 mdash 2479 2312 1416

Hayes-Roth

80 41 568 310 278160 57 643 395 347320 125 726 460 387640 715 973 660 4381280 1980 1479 1211 7362560 mdash 2423 2120 1567

Lenses

80 32 624 309 297160 59 701 389 327320 130 924 452 376640 700 1072 545 4321280 1895 1378 1085 8142560 mdash 2379 2089 1473

Wine

80 50 635 350 317160 78 705 420 356320 130 835 470 402640 730 1006 610 4261280 2100 1346 1245 8432560 mdash 2463 2240 1645

63 Evaluation on Accuracy To evaluate the accuracy wecluster different data sets via Par3PKM Par2PK-Means andParCLARA based on a Hadoop cluster with eight nodesand through 119870-Means in the single machine environmentrespectively We then plot the results in Figure 12(a)

The quality of the algorithm is evaluated via the followingerror rate (ER) [51] equation

ER =119874119898

119874119905

times 100 (8)

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 13

0

80 160 320 640 1280 2560

500

1000

1500

2000

2500

3000

Exec

utio

n tim

e (s)

Size of data sets (MB)

K-MeansParCLARA

Par2PK-MeansPar3PKM

(a)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(b)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(c)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(d)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(e)

Figure 11 Continued

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

14 Discrete Dynamics in Nature and Society

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(f)

0

500

1000

1500

2000

2500

Exec

utio

n tim

e (s)

80 160 320 640 1280 2560

Size of data sets (MB)

ParCLARAPar2PK-MeansPar3PKM

K-Means

(g)

Figure 11 Efficiency comparison on different data sets (a) Taxi Trajectory (b) Iris (c) Habermanrsquos Survival (d) Ecoli (e) Hayes-Roth (f)Lenses and (g) Wine

ParCLARAPar2PK-MeansPar3PKM

K-Means

Iris HabermanrsquosSurvial

Ecoli Lenses Hayes-Roth Wine

Data sets

0

5

10

15

20

25

30

35

Erro

r rat

e (

)

(a)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

Number of faulty nodes

(b)

Figure 12 Accuracy and reliability (a) Accuracy comparison of Par3PKM Par2PK-Means and ParCLARA on different data sets and (b)reliability of Par3PKM

where 119874119898 is the number of misclassified objects and 119874119905 isthe total number of objects The lower the ER the better theclustering

As illustrated in Figure 12(a) in comparison with otheralgorithms the Par3PKM algorithm produces more accurateclustering results in most cases The results indicate that thePar3PKM algorithm is valid and feasible

64 Evaluation on Speedup In order to evaluate the speedupof the Par3PKM algorithm we keep seven data sets constant(1280MB) and increase the number of nodes (ranging from

1 node to 8 nodes) on a Hadoop cluster and then plot theresults in Figure 13(a) Moreover we utilize Iris data sets(1280MB) for further verifying the speedup of Par3PKMin comparison with Par2PK-Means and ParCLARA and theresults are illustrated in Figure 13(b)

The speedup metric [52 53] is defined as

Speedup =119879119904

119879119901

(9)

where 119879119904 represents the execution time of an algorithm onone node (ie the sequential execution time) for clustering

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 15: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 15

Number of nodes

Spee

dup

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

LinearTaxi TrajectoryIrisHabermanrsquos Survival

EcoliHayes-RothLensesWine

(a)Sp

eedu

p

1

2

3

4

5

6

7

8

Number of nodes1 2 3 4 5 6 7 8

LinearPar3PKM

Par2PK-MeansParCLARA

(b)

Figure 13 Speedup (a) Speedup comparison of Par3PKM on different data sets and (b) speedup comparison of Par3PKM Par2PK-Meansand ParCLARA

objects using the given data sets and119879119901 denotes the executiontime of an algorithm for solving the same problem using thesame data sets on a Hadoop cluster with 119901 nodes (ie theparallel execution time) respectively

As depicted in Figure 13(a) the speedup of the Par3PKMalgorithm increases relatively linearly with an increasingnumber of nodes It is known that linear speedup is difficultto achieve because of the communication cost and the skewof the slaves Furthermore Figure 13(b) shows that Par3PKMhas better speedup than Par2PK-Means and ParCLARATheresults demonstrate that the parallel algorithm Par3PKM hasa very good speedup performance which is almost the samefor data sets with very different sizes

65 Evaluation on Scale-Up To evaluate how well thePar3PKM algorithm processes larger data sets when morenodes are available we perform scale-up experiments wherewe increase the size of data sets (varying from 160MBto 1280MB) in direct proportion to the number of nodes(ranging from 1 node to 8 nodes) and then plot the resultsin Figure 14(a) Furthermore with the Iris data sets (vary-ing from 160MB to 1280MB) the scale-up comparison ofPar3PKM Par2PK-Means and ParCLARA is depicted inFigure 14(b)

The scale-up metric [52 53] is given by

Scale-up =119879119904

119901

(10)

where 119879119904 is the execution time of an algorithm for processingthe given data sets on one node and 119901 is the execution time ofan algorithm for handling 119901-times larger data sets on 119901-timeslarger nodes

As illustrated in Figure 14(a) the scale-up values ofPar3PKM are in the vicinity of 1 even less with the propor-tional growth of both the number of nodes and the size of datasets Moreover Figure 14(b) shows that Par3PKM has betterscale-up than Par2PK-Means and ParCLARA The resultsindicate that the Par3PKMalgorithmhas very excellent scale-up and adaptability in large-scale data sets based on aHadoopcluster with MapReduce

66 Evaluation on Reliability To evaluate the reliability of thePar3PKM algorithm we shut down several nodes (rangingfrom 1 node to 7 nodes) to demonstrate whether Par3PKMcan normally execute and achieve the same clustering resultsfrom Iris data sets with 1280MB and then plot the results inFigure 12(b)

As illustrated in Figure 12(b) although the executiontime of the Par3PKM algorithm increases gradually withthe growth of the number of faulty nodes the Par3PKMalgorithm still normally executes and produces the sameresults The results show that the Par3PKM algorithm hasgood reliability in ldquoBig Datardquo environment due to the highfault tolerance of the MapReduce framework on a Hadoopplatform When a node cannot execute tasks on a Hadoopcluster the JobTracker will automatically assign the tasks offaulty node(s) to other spare nodes Conversely the serial119870-Means algorithm on a single machine cannot normallyexecute the tasks when the machine is faulty whereas theentire computational task will fail

In summary extensive experiments are conducted on realand synthetic data and the performance results demonstratethat the proposed Par3PKM algorithm ismuchmore efficientand accurate with better speedup scale-up and reliability

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 16: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

16 Discrete Dynamics in Nature and Society

Number of nodes1 2 3 4 5 6 7 8

Taxi TrajectoryIrisHabermanrsquos SurvivalEcoli

Hayes-RothLensesWine

105

100

095

090

085

080

075

070

065

060

Scal

e-up

(a)

Number of nodes1 2 3 4 5 6 7 8

Par3PKMPar2PK-MeansParCLARA

105

100

095

090

085

080

075

070

065

060

055

050

Scal

e-up

(b)

Figure 14 Scale-up (a) Scale-up comparison of Par3PKM on different data sets and (b) scale-up comparison of Par3PKM Par2PK-Meansand ParCLARA

7 Conclusions

In this paper we have proposed an efficient MapReduce-based parallel clustering algorithm named Par3PKM tosolve traffic subarea division problem with large-scale taxitrajectories In Par3PKM the distance metric and initializa-tion strategy of 119870-Means are optimized in order to enhancethe accuracy of clustering Then to improve the efficiencyand scalability of Par3PKM the optimal119870-Means algorithmis implemented in a MapReduce framework on Hadoop Theoptimization and parallelism of Par3PKM save memory con-sumption and reduce the computational cost of big calcula-tions thereby significantly improving the accuracy efficiencyscalability and reliability of traffic subarea division Our per-formance evaluation indicates that the proposed algorithmcan efficiently cluster a large number of GPS trajectories oftaxicabs and especially achieves more accurate results than119870-Means Par2PK-Means and ParCLARA with favorablysuperior performance Furthermore based on Par3PKM wehave presented a distributed traffic subarea division methodnamedDTSAD which is performed on aHadoop distributedcomputing platform with the MapReduce parallel processingparadigm In DTSAD the boundary identifying method caneffectively connect the borders of clustering results Mostimportantly we have divided traffic subarea of Beijing usingbig real-world taxi trajectory data sets through the presentedmethod and our case study demonstrates that our approachcan accurately and efficiently divide traffic subarea

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Dawen Xia and Binfeng Wang contributed equally to thiswork

Acknowledgments

The authors would like to thank the academic editor and theanonymous reviewers for their valuable comments and sug-gestions This work was partially supported by the NationalNatural Science Foundation of China (Grant no 61402380)the Scientific Project of State Ethnic Affairs Commissionof the Peoplersquos Republic of China (Grant no 14GZZ012)the Science and Technology Foundation of Guizhou (Grantno LH20147386) and the Fundamental Research Funds forthe Central Universities (Grants nos XDJK2015B030 andXDJK2015D029)

References

[1] Y Qi and S Ishak ldquoStochastic approach for short-term freewaytraffic prediction during peak periodsrdquo IEEE Transactions onIntelligent Transportation Systems vol 14 no 2 pp 660ndash6722013

[2] A de Palma and R Lindsey ldquoTraffic congestion pricingmethodologies and technologiesrdquo Transportation Research PartC Emerging Technologies vol 19 no 6 pp 1377ndash1399 2011

[3] V Marx ldquoThe big challenges of big datardquo Nature vol 498 no7453 pp 255ndash260 2013

[4] D Agrawal P Bernstein E Bertino et al ldquoChallenges andopportunities with big data a community white paper devel-oped by leading researchers across the United Statesrdquo WhitePaper 2012

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 17: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Discrete Dynamics in Nature and Society 17

[5] ldquoSpecial online collection dealing with datardquo Science vol 331no 6018 pp 639ndash806 2011

[6] ldquoBig data science in the petabyte erardquoNature vol 455 no 7209pp 1ndash136 2008

[7] R R Weiss and L Zgorski Obama Administration UnveilslsquoBig Datarsquo Initiative Announces $200 Million in New RampDInvestments Office of Science and Technology Policy ExecutiveOffice of the President 2012

[8] J Manyika M Chui B Brown et al ldquoBig data the nextfrontier for innovation competition and productivityrdquo TechRep McKinsey Global Institute 2011

[9] Y Genovese and S Prentice ldquoPattern-based strategy gettingvalue from big datardquo Gartner Special Report 2011

[10] N J Yuan Y Zheng L Zhang and X Xie ldquoT-finder arecommender system for finding passengers and vacant taxisrdquoIEEE Transactions on Knowledge and Data Engineering vol 25no 10 pp 2390ndash2403 2013

[11] J Yuan Y Zheng C Zhang et al ldquoT-drive driving directionsbased on taxi trajectoriesrdquo in Proceedings of the 18th Inter-national Conference on Advances in Geographic InformationSystems (ACM SIGSPATIAL GIS rsquo10) pp 99ndash108 ACM SanJose Calif USA November 2010

[12] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Waltham MassUSA 3rd edition 2011

[13] D Pelleg and A Moore ldquoX-means extending K-means withefficient estimation of the number of clustersrdquo in Proceedings ofthe 17th International Conference on Machine Learning (ICMLrsquo00) pp 727ndash734 Stanford Calif USA June 2000

[14] S Kantabutra and A L Couch ldquoParallel K-means clusteringalgorithm on NOWsrdquo NECTEC Technical Journal vol 1 no 6pp 243ndash247 2000

[15] Y Zhang Z Xiong J Mao and L Ou ldquoThe study of parallelK-means algorithmrdquo in Proceedings of the 6th World Congresson Intelligent Control and Automation (WCICA rsquo06) pp 5868ndash5871 IEEE Dalian China June 2006

[16] P Kraj A Sharma N Garge R Podolsky and R A McIndoeldquoParaKMeans implementation of a parallelized K-means algo-rithm suitable for general laboratory userdquo BMC Bioinformaticsvol 9 article 200 13 pages 2008

[17] M K Pakhira ldquoClustering large databases in distributed envi-ronmentrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 351ndash358 Patiala IndiaMarch 2009

[18] K J Kohlhoff V S Pande and R B Altman ldquoK-means forparallel architectures using all-prefix-sum sorting and updatingstepsrdquo IEEE Transactions on Parallel and Distributed Systemsvol 24 no 8 pp 1602ndash1612 2013

[19] C-T Chu S K Kim Y-A Lin et al ldquoMap-reduce for machinelearning on multicorerdquo in Proceedings of the 20th AnnualConference onNeural Information Processing Systems (NIPS rsquo06)pp 281ndash288 Vancouver Canada 2006

[20] W Zhao H Ma and Q He ldquoParallel K-means clusteringbased on MapReducerdquo in Proceedings of the 1st InternationalConference on Cloud Computing (CloudCom rsquo09) pp 674ndash679Beijing China December 2009

[21] P Zhou J Lei and W Ye ldquoLarge-scale data sets clusteringbased on MapReduce and Hadooprdquo Journal of ComputationalInformation Systems vol 7 no 16 pp 5956ndash5963 2011

[22] C D Nguyen D T Nguyen and V-H Pham ldquoParallel two-phase K-meansrdquo in Proceedings of the 13th International Confer-ence on Computational Science and Its Applications (ICCSA rsquo13)pp 24ndash27 Ho Chi Minh City Vietnam June 2013

[23] R J Walinchus ldquoReal-time network decomposition and sub-network interfacingrdquo Tech Rep HS-011 999 Highway ResearchRecord 1971

[24] S C Wong W T Wong C M Leung and C O TongldquoGroup-based optimization of a time-dependent TRANSYTtraffic model for area traffic controlrdquo Transportation ResearchPart B Methodological vol 36 no 4 pp 291ndash312 2002

[25] D I Robertson and R D Bretherton ldquoOptimizing networksof traffic signals in real time the SCOOT methodrdquo IEEETransactions on Vehicular Technology vol 40 no 1 pp 11ndash151991

[26] Y-Y Ma and X-G Yang ldquoTraffic sub-area division expertsystem for urban traffic controlrdquo in Proceedings of the Inter-national Conference on Intelligent Computation Technology andAutomation (ICICTA rsquo08) pp 589ndash593 Hunan China October2008

[27] K Lu J-M Xu S-J Zheng and S-M Wang ldquoResearch on fastdynamic division method of coordinated control subareardquoActaAutomatica Sinica vol 38 no 2 pp 279ndash287 2012

[28] K Lu J-M Xu and S-J Zheng ldquoCorrelation degree analysis ofneighboring intersections and its applicationrdquo Journal of SouthChina University of Technology vol 37 no 11 pp 37ndash42 2009

[29] H Guo J Cheng Q Peng C Zhu and Y Mu ldquoDynamicdivision of traffic control sub-area methods based on the simi-larity of adjacent intersectionsrdquo in Proceedings of the IEEE 17thInternational Conference on Intelligent Transportation Systems(ITSC rsquo14) pp 2208ndash2213 Qingdao China October 2014

[30] C Li Y Xie H Zhang and X-L Yan ldquoDynamic divisionabout traffic control sub-area based on back propagation neuralnetworkrdquo in Proceedings of the 2nd International Conference onIntelligent Human-Machine Systems and Cybernetics (IHMSCrsquo10) pp 22ndash25 Nanjing China August 2010

[31] Z Zhou S Lin and Y Xi ldquoA fast network partition methodfor large-scale urban traffic networksrdquo Journal of ControlTheoryand Applications vol 11 no 3 pp 359ndash366 2013

[32] J MacQueen ldquoSome methods for classification and analysis ofmultivariate observationsrdquo in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability pp 281ndash297 University of California Press Berkeley Calif USA 1967

[33] A K Jain ldquoData clustering 50 years beyond K-meansrdquo PatternRecognition Letters vol 31 no 8 pp 651ndash666 2010

[34] XWu V Kumar J R Quinlan et al ldquoTop 10 algorithms in dataminingrdquo Knowledge and Information Systems vol 14 no 1 pp1ndash37 2008

[35] W Zou Y Zhu H Chen and X Sui ldquoA clustering approachusing cooperative artificial bee colony algorithmrdquo DiscreteDynamics in Nature and Society vol 2010 Article ID 45979616 pages 2010

[36] D T Pham S S Dimov and C D Nguyen ldquoAn incremental K-means algorithmrdquo Proceedings of the Institution of MechanicalEngineers Part C Journal ofMechanical Engineering Science vol218 no 7 pp 783ndash794 2004

[37] D T Pham S S Dimov and C D Nguyen ldquoA two-phaseK-means algorithm for large datasetsrdquo Journal of MechanicalEngineering Science vol 218 no 10 pp 1269ndash1273 2004

[38] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoopdistributed file systemrdquo in Proceedings of the 26th Symposium

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 18: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

18 Discrete Dynamics in Nature and Society

on Mass Storage Systems and Technologies (MSST rsquo10) pp 1ndash10IEEE Incline Village Nev USA May 2010

[39] A Ene S Im and B Moseley ldquoFast clustering using MapRe-ducerdquo in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining (KDD rsquo11)pp 681ndash689 San Diego Calif USA August 2011

[40] T White Hadoop The Definitive Guide OrsquoReilly MediaSebastopol Calif USA 3rd edition 2012

[41] D Xia Z Rong Y Zhou Y Li Y Shen and Z Zhang ldquoA novelparallel algorithm for frequent itemsetsmining inmassive smallfiles datasetsrdquo ICIC Express Letters Part B Applications vol 5no 2 pp 459ndash466 2014

[42] D Xia Z Rong Y Zhou B Wang Y Li and Z ZhangldquoDiscovery and analysis of usage data based on Hadoop forpersonalized information accessrdquo in Proceedings of the IEEE16th International Conference on Computational Science andEngineeringmdashBig Data Science and Engineering (CSE-BDSE rsquo13)pp 917ndash924 IEEE Sydney Australia December 2013

[43] D Xia B Wang Z Rong Y Li and Z Zhang ldquoEffectivemethods and strategies for massive small files processing basedon Hadooprdquo ICIC Express Letters vol 8 no 7 pp 1935ndash19412014

[44] S Ghemawat H Gobioff and S-T Leung ldquoThe google file sys-temrdquo in Proceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP rsquo03) pp 29ndash43 Bolton Landing NYUSA October 2003

[45] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[46] P Zikopoulos C Eaton D deRoos T Deutsch and G LapisUnderstanding Big Data Analytics for Enterprise Class Hadoopand Streaming Data McGraw-Hill New York NY USA 2011

[47] W K D Pun and A B M S Ali ldquoUnique distance measureapproach for K-means (UDMA-Km) clustering algorithmrdquo inProceedings of the IEEE Region 10 Conference (TENCON rsquo07)pp 1ndash4 IEEE Taipei Taiwan November 2007

[48] A M Fahim A M Salem F A Torkey and M A RamadanldquoAn efficient enhanced K-means clustering algorithmrdquo Journalof Zhejiang University SCIENCE A vol 7 no 10 pp 1626ndash16332006

[49] M Zhu Data Mining University of Science and Technology ofChina Press 2002

[50] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis John Wiley amp Sons 1990

[51] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[52] S Englert J Gray T Kocher and P Shah ldquoA benchmark ofNonStop SQL release 2 demonstrating near-linear speedup andscaleup on large databasesrdquo ACM SIGMETRICS PerformanceEvaluation Review vol 18 no 1 pp 245ndash246 1990

[53] X Xu J Jager and H-P Kriegel ldquoA fast parallel clustering algo-rithm for large spatial databasesrdquo Data Mining and KnowledgeDiscovery vol 3 no 3 pp 263ndash290 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 19: Research Article An Efficient MapReduce-Based Parallel ...downloads.hindawi.com/journals/ddns/2015/793010.pdf · An Efficient MapReduce-Based Parallel Clustering Algorithm ... distribution,

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of