Road Recognition using Coarse-grained Vehicular Traces · accuracy. When there are 2,000 taxis and...

Road Recognition using Coarse-grained Vehicular Traces

Xuemei Liu, Yanmin Zhu, Yin Wang, George Forman, Lionel M. Ni, Yu Fang, Minglu Li

HP Laboratories

HPL-2012-26

Keyword(s):

map building; GPS; road recognition

Abstract:

With more and more vehicles equipped with GPS tracking devices, there is increasing interest in building

and updating maps using vehicular GPS traces. Most existing approaches for building maps rely on

position traces from highly accurate positioning devices, which are sampled at a high frequency, e.g.,

every second. Typically these traces are recorded by survey vehicles. Commodity GPS devices are much

more widespread, but have lower accuracy. In addition, the sampling frequency is low (at around once per

minute) in order to reduce communication cost. Building maps from coarse-grained vehicular GPS traces

is challenging due to the inherent noise in commodity GPS devices and the shape complexity of urban

roads. In this paper, we propose a novel algorithm for recognizing urban roads with coarsegrained GPS

traces from probe vehicles moving in urban areas. The algorithm overcomes the challenges by pruning

low quality GPS traces, clustering GPS traces by road segments, and applying shape-aware B-spline

fitting. We have conducted empirical study with a real data set of GPS traces from 2,300 taxis in

Shanghai, China. Evaluation results demonstrate that our algorithm provides wide coverage, a low rate of

false positives, and high accuracy. When there are 2,000 taxis and the time window for trace collection is

1.5 hours, the coverage of arterial roads is 93% and the rate of false positives is 5%. The roads recognized

by our algorithm are more accurate than the roads on OpenStreetMap, a popular map editing website

using GPS traces and satellite imagery.

External Posting Date: February 21, 2012 [Fulltext] Approved for External Publication Internal Posting Date: February 21, 2012 [Fulltext]

Copyright 2012 Hewlett-Packard Development Company, L.P.

Road Recognition using Coarse-grained Vehicular Traces

Xuemei Liu1, Yanmin Zhu1, Yin Wang2, George Forman2, Lionel M. Ni3, Yu Fang1, Minglu Li11Shanghai Jiao Tong University; 2HP Labs, Palo Alto; 3Hong Kong University of Science and Technology

1{xuemeiliu, yzhu, yufang, mlli}@sjtu.edu.cn; 2{yin.wang, george.forman}@hp.com; [email protected]

Abstract—With more and more vehicles equipped with GPStracking devices, there is increasing interest in building and up-dating maps using vehicular GPS traces. Most existing approachesfor building maps rely on position traces from highly accuratepositioning devices, which are sampled at a high frequency,e.g., every second. Typically these traces are recorded by surveyvehicles. Commodity GPS devices are much more widespread, buthave lower accuracy. In addition, the sampling frequency is low(at around once per minute) in order to reduce communicationcost. Building maps from coarse-grained vehicular GPS traces ischallenging due to the inherent noise in commodity GPS devicesand the shape complexity of urban roads. In this paper, wepropose a novel algorithm for recognizing urban roads with coarse-grained GPS traces from probe vehicles moving in urban areas.The algorithm overcomes the challenges by pruning low qualityGPS traces, clustering GPS traces by road segments, and applyingshape-aware B-spline fitting. We have conducted empirical studywith a real data set of GPS traces from 2,300 taxis in Shanghai,China. Evaluation results demonstrate that our algorithm provideswide coverage, a low rate of false positives, and high accuracy.When there are 2,000 taxis and the time window for trace collectionis 1.5 hours, the coverage of arterial roads is 93% and the rate offalse positives is 5%. The roads recognized by our algorithm aremore accurate than the roads on OpenStreetMap, a popular mapediting website using GPS traces and satellite imagery.

I. INTRODUCTION

Accurate and updated road maps are critical for route planningand vehicle navigation. Most existing road maps are obtainedfrom geological survey, which is an expensive and time-consuming process. These maps become frequently outdated infast growing regions. Even in developed countries with relativelystable road networks, road reconfiguration or closure due toaccident or maintenance is frequent. Lack of updating roadshas caused fatal accidents with even experienced drivers [2].

With an ever increasing number of vehicles equipped withdedicated GPS tracking devices or GPS-enabled phones, e.g.,taxis [17], buses [4], commercial and utility vehicles, vehicularGPS traces are becoming pervasively available. GPS trackingdevices installed in vehicles periodically report their measure-ments, including position coordinates, speed and heading direc-tion. These measurements can be conveniently gathered in realtime through ubiquitous cellular wireless networks. Thus, thereis considerable interest in building and updating maps usingvehicular GPS traces [7, 9, 19, 22, 25].

There have been a few existing algorithms for building mapswith position traces. However, most existing map buildingalgorithms assume high GPS sampling rate [7, 19], typicallyat 1Hz and the position samples are highly accurate. Theseposition traces are usually recorded by special survey vehicles.In practice, however, commodity GPS devices typically havemuch lower accuracy, at around tens of meters in urban areaswith lots of skyscrapers. In addition, the sampling frequencyis relatively low (around once per minute) in order to reducecommunication cost.

Building maps from coarse-grained vehicular GPS traces ischallenging due to the inherent noise in commodity GPS devices

and low sampling rate. In this paper, we propose to exploitcoarse-grained vehicular GPS traces for road recognition in anurban area. GPS records from vehicles moving in the urbanarea are collected through a cellular network to a central server.A large sampling interval is used to reduce communicationcost, ranging from 15 seconds to 5 minutes [21]. At thesesampling intervals, our experiments with real taxi traces showthat adjacent GPS samples often come from different roads, andthe trajectories constructed from samples bare little similaritywith the road network geometry.

In addition, we also find that urban road shapes are complex.Our study with the road network in Shanghai shows that around20% of the urban roads have a degree of curve larger than45∘. This indicates that a considerable percentage of roads arenot simply straight. This considerably increases the difficultyto compute roads from a set of discrete and noisy samples.Previous work usually handles roads with shapes of relativelystraight lines [7, 9, 22, 25].

In this paper, we propose a novel algorithm for recognizingurban roads using coarse-grained GPS traces. Exploiting auxil-iary information in the GPS record such as heading direction,our algorithm overcomes the challenges by pruning low qualityGPS traces, clustering traces by road segments and applyingshape-aware B-spline fitting to each cluster. For a taxi, consec-utive samples in close vicinity and bearing similar directionsare considered to be generated from the same road segment.Trajectories residing on the same road but from different taxisare grouped together using geometrical algorithms. Finally, weapply shape-aware fitting to find the road centerline. In thispaper, we explore road recognition, which is the key buildingblock for building routable maps. Recognition of intersectionsand detection of lanes [9, 10] are beyond the scope of thispaper.

We employ a large-scale data set of GPS traces from taxis inShanghai for our empirical study. For the purpose of vehicle andtraffic regulation, e.g., speed violation detection, the ShanghaiTransport Authority collects the GPS traces of the taxis. Our dataset extracts 2,300 taxis from the trace database for one weekin 2007. Our empirical results demonstrate that the algorithmprovides wide coverage, a low rate of false positives, and highaccuracy. When there are 2,000 taxis and the time window fordata collection is 1.5 hours, the coverage of arterial roads is 93%and the rate of false positives is 5%. The roads recognized by ouralgorithm are more accurate than the roads on OpenStreetMap(OSM) [12], the largest open map community.

The contributions of this paper are the following.∙ We propose a road recognition framework using perva-

sively available vehicular GPS traces. To the best of ourknowledge, this is the first attempt to recognizing roadswith low-sampling-rate vehicular GPS traces.

∙ We design a road recognition algorithm that overcomes thechallenges of inherent noisy GPS positions and complexurban road shapes, by pruning abnormal traces, grouping

2

Probe

Vehicle

Internet

ServerUser

Figure 1: The architecture of the system for recognizing urbanroads with vehicular traces.

relevant traces and applying shape-aware fitting.∙ We have conducted empirical evaluation with a real data

set of GPS traces from 2,300 taxis in Shanghai, China.Evaluation results show that our algorithm provides widecoverage, a low rate of false positives, and high accuracy.

The rest of this paper is organized as follows. The nextsection presents the system overview. Section III analyzes thereal data set of taxi GPS traces and shows the characteristicsof urban roads. Section IV describes our road recognition algo-rithm. Section V evaluates the algorithm empirically. Section VIsummarizes related work, and Section VII concludes the paper.

II. OVERVIEW

This section presents the system overview and then statesthe problem of road recognition with coarse-grained vehicularGPS traces.

A. Architecture

The architecture of the system for recognizing roads withvehicular GPS traces is illustrated in Figure 1. There are Nprobe vehicles in the system, denoted by N = {0, 1, ..., ∣N ∣ −1}. Each GPS-enabled vehicle moves at free will in the roadnetwork and generates GPS samples periodically. The samplefrom vehicle i at time t is denoted by ri(t) :< ai(t), pi(t) >,representing its instantaneous direction of headway, and position(longitude and latitude), respectively. The sample is deliveredto a central server through a data channel of a cellular wirelessnetwork .

Let Ti denote the set of timestamps at which vehicle i reportsits states, Ti = {ti0, ti1, ..., tik−1}, in which ti0 and tik−1 arethe first and the last timestamp, respectively. Thus, vehiclei generates a report set between Ta and Tb, Ri(Ta, Tb) ={ri(t)∣t ∈ Ti, Ta < t < Tb}. The sampling time of differentvehicles is not synchronized, and the sampling interval of thesame vehicle is not a constant [21].

The server collects all GPS samples from the set of all vehi-cles and applies the road recognition algorithm to be detailed inSection IV. The recognized roads can be made available onlinefor public access, e.g., updating open-source maps [12].

B. Road Recognition Problem

Our system constructs and updates maps from coarse-grainedvehicular GPS traces generated by probe vehicles moving inurban areas. A complete routable map used in navigation deviceswould contain geometry, lane configuration, speed limit, turnrestriction, road type information, etc. Our current work focuses

on the recognition of urban road geometry, which is of firstpriority in building routable maps.

The system keeps receiving GPS samples from probe vehi-cles. The set of of GPS traces collected from T0 to Tnow isdenoted as Ω(T0, Tnow) =

∑Ri(T0, Tnow), i ∈ N . We want

to recognize road geometry by using the set of all receivedvehicular traces, Ω(T0, Tnow).

For road recongnition, we consider the following majorperformance metrics. First, high coverage is desirable. For givenΩ(T0, Tnow), we want to recognize as many roads as possible.Second, the system should produce a low rate of false positives.Third, it is also desirable to gain high accuracy. Three aspectsare defined to measure accuracy, e.g., horizonal shift, verticalshift, and separation distance between the recognized roads andthe real roads. The precise definitions of these metrics are givenin Section V.

It is apparent that with longer time for collecting vehicularGPS traces, we have a larger data set. Then, it is intuitive that itbecomes easier to recognize urban roads. In practice, however,it is desirable to build and update maps in shorter time as thisallows the update of maps to be quicker. This paper assumesa fixed set of vehicular traces collected within a certain timeduration. In empirical evaluation, we will study the impact oflength of time window, T0 to Tnow, on the performance of roadrecognition.

III. ANALYSIS OF TAXI TRACES AND URBAN ROADS

In this section we first analyze the real data set of taxi GPStraces collected from around two thousands taxis in Shang-hai, China, and show that commodity GPS devices produceinaccurate positions and uneven distribution of samples. Then,we investigate the degree of curve of urban roads. Finally, wedemonstrate that the roads on the OSM Map are less inaccurate.

A. Taxi GPS Trace

Our data set are extracted from a taxi GPS trace databasemaintained by Shanghai Transport Authority. The data set spansfrom March 2006 to November 2007. As the taxi trackingsystem itself is evolving, the set of taxis contained in our data setis not fixed. The total number of taxis in our data set ranges fromtwo thousand to seven thousand each day, and the characteristicsof each taxi trace is different due to data aggregation fromdifferent taxi companies. The detailed analysis of our taxi GPSdata is available in [21], albeit using a different time period. Inthis study, we use the data set from 02/18 to 02/24 in 2007. Wesummarize the statistics of the data set below.

1) Number of Taxis: There are totally 4,380 taxis in our dataset. However, exactly 2,000 of them are always reporting a fixedposition. These taxis reported non-stationary GPS traces in thepast (corresponding to SH-A group in [21]). The position reportcould have been turned off accidentally or intentionally.

2) Sampling Interval: For the purpose of taxi dispatching,the sampling interval when occupied can be longer than thatwhen vacant. Furthermore, a status change can trigger a sampleany time, and the transmission is not reliable. Nevertheless, thesampling intervals of most taxis are very prominent, and canbe detected using a simple threshold [21]. Among all taxis thatreport non-stationary GPS coordinates, 1,855 samples at 16 sec-ond when vacant and 61 second when occupied (correspondingto SH-B group in [21]); 430 samples at a fixed interval 60-61sec.

3

Figure 2: Distribution of vehicular GPS samples and theirheading directions.

0 200 400 600 8000

800

1600

2400

3200

4000

num

ber

of f

ootp

rint

s

distance to the nearest road (m)0 200 400 600 800

0

0.2

0.4

0.6

0.8

1

CD

F

Figure 3: Distance of samples to nearest roads.

3) Resolution of GPS Positions: The GPS coordinates re-ported include four fraction digits, i.e., 0.0001∘, which is 8.5 or11.1 meters along latitudinal or longitudinal line in Shanghai, re-spectively. Thus, the position resolution is

√( 8.52 )2 + ( 11.12 )2 ≈

7 meters, denoted by D.4) Noisy GPS Positions in Urban Area: Besides low GPS

resolution, our study with taxi GPS traces also shows thatpositions of commodity GPS are noisy and inaccurate [20, 21].In [21], it is pointed out that some consecutive samples with astraight-line speed exceeding 600 km/h. Multipath error insidea city canyon surrounded by high-rise buildings is a main causefor GPS noise, and we found samples rather randomly dis-tributed under city canyon effect. Figure 3 shows the distributionof the distance between samples and nearest roads. We findnearly 20% of samples more than 100 meters away from anyroad.

5) Heading Direction: The heading directions in our GPSrecords are coarse-grained, containing only eight cardinal direc-tions. Thus the resolution of vehicle heading direction is 22.5∘.Figure 2 also displays the moving direction of each GPS sampleas a short tail before the dot.

!

Figure 4: A polyline and its degree of curve µ.

0 .5 1 1.5 2 2.5 3 3.5 4+10

0

101

102

103

num

ber

of r

oad

segm

ents

(a) length of road segments (km)

0 45 90 135 180+10

0

101

102

103

104

num

ber

of r

oad

segm

ents

(b) degree of curve of road segment (degree)

Figure 5: Characteristics of road segments in Shanghai, China.

B. Characteristics of Urban Roads

For the purpose of verification, we use OSM as a refer-ence [1]. Two sources contributed to the majority of roads inShanghai. Arterial roads are imported from a commercial mapdonated by Automotive Navigation Data Inc. Local roads aremostly manually created based on aerial imagery.

OSM represents roads as polylines, i.e., a sequence of straightlines. Each named road may be divided into multiple seg-ments by intersections. Each road segment is represented bya polyline. As illustrated in Figure 4, we denote a polylineas P = (p1, p2, ..., pk). As of June 25th, 2011, there aretotally 14,473 polylines representing road segments within theboundary of Shanghai. Figure 5a shows the length histogram ofroad segments. Among all 14,473 road segments, 12,160 (84%)are shorter than 1 km.

Importantly, we investigate the complexity of urban roadshapes. To this end, we define the degree of curve of a roadsegment as the maximum degree that a vehicle has to turnclockwise or counterclockwise when driving through the roadsegment, as illustrated in Figure 4. Figure 5b shows the degreeof curve of road segments in Shanghai. We can find that only4,967 (34%) road segments have a degree of curve smaller thanone degree, and around 2,895 (20%) urban road segments havea degree of curve larger than 45∘. This clearly shows that theshapes of urban roads can be complex, not simply straight.

C. Inaccuracy of OSM Maps

We find that the roads on the OSM map inconsistent withour taxi GPS traces. Figure 6 is a prominent example. For thepurpose of illustration, we connect two consecutive sampleswhen the interval is less or equal to 30 seconds. The coordinatesof the street is consistent with the aerial image on OSM andtherefore disagree with the GPS trace data. This inconsistencyis likely a result of image distortion because different positionsof the map exhibit varying degrees of map shift. Furthermore,the coordinates of some arterial roads are inconsistent with theaerial image but agree with our GPS data. The detailed analysisis available in [21]. Finally, there are many missing roads andnewly constructed roads in the map which do not have any GPS

4

Figure 6: Map inconsistency (white dots are samples).

traces. This clearly suggests that it is necessary to recognize andupdate roads with vehicular GPS traces.

IV. DESIGN OF ROAD RECOGNITION ALGORITHM

The goal of the algorithm is to recognize as many roads aspossible based on GPS traces Ω(T0, Tnow) generated by probevehicles from T0 to Tnow, with high accuracy and low rate offalse positives.

The design of the algorithm faces several challenges. First, theunderlying road shapes are complex as introduced in SectionIII-B. Second, there are inherent errors associated with bothGPS position and heading direction. As shown in Figure 2, aportion of the samples deviate from the roads or their headingdirections do not agree with the road orientation strictly. Third,GPS samples are far from a uniform distribution along theroads. As a consequence, many road segments are covered byonly a few samples. Thus, it is intuitively difficult to recognizesuch roads. Fourth, we can connect the consecutive samplesof a vehicle and get a continuous trajectory. Although suchtrajectories can be useful to recognize roads with just a fewsamples, they also bring negative influence. In Figure 7, we cansee that a considerable number of the raw trajectories are messyand provide no useful information for road recognition. This isbecause these raw trajectories are not the actual drive paths ofthe vehicles.

A. Basic Idea

From Figure 7, we can see that underlying road geometrycan be roughly shaped by the distribution of vehicle trajectoriesin spite of existing messy raw trajectory portions, and vehicletrajectories can provide more information for recognizing roadsthan samples alone. Thus, we develop a novel technique forrecognizing roads from error-prone samples. The basic ideais to first cluster samples belonging to the same road basedon trajectory similarity and then apply a fitting procedure tocompute the centerline of the road.

The first key issue with the technique is to determine the unitof clustering. There is no obvious such unit. In order to applycurve fitting, the important requirement is that the target curveto be fitted should be a function (i.e., one to one mapping). Itshould be stressed here that it is sufficient that the target curverepresents a function if there exists a transformation from theoriginal coordinate system to a new coordinate system. Thisis because we can perform curve fitting in the new coordinatesystem and then convert the obtained curve back to a curvein the original coordinate system. Therefore, this gives us theprinciple for clustering samples: to cluster samples in such away that the target curve represents a function.

The second key issue is on the number of clusters that wecan generate. It is difficult to determine the optimal number

of clusters given the set of maps and the target map of roads.However, the number of clusters should be minimized so thatthe number of resulting fitted curves would be minimized. Thisalso suggests that the computed roads would be smoother andcloser to the real roads.

Our technique consists of three main steps. In Step 1, wefirst generate one trajectory for each vehicle by connecting itsconsecutive samples. Then, those abnormal portions of trajec-tories are pruned. In Step 2, we cluster samples by groupingthe vehicle trajectory from the same road into one cluster. InStep 3, a shape-aware fitting method is employed to generatethe centerline of each road.

For clarity of presentation, we give some notations. Let fi andfi+1 denote two consecutive samples from a vehicle and fi+1 isgenerated later than fi, and o(fi) denote the heading direction ofsample fi. Let L(fi, fi+1) denote the line segment connectingthe two samples, and o(L(fi, fi+1)) denote the orientation of theline segment. Both pruning samples and basic clustering will beexecuted based on line segments of trajectories.

B. Data PruningTo eliminate negative influence of messy trajectory portions,

we have to identify and prune them. We propose two pruningcriteria for identification of messy trajectory portions.

Direction fitness criterion: A line segment is useful to roadrecognition only if the segment approximates the real drivepath of the vehicle. Example line segments are L2 and L3

in Figure 10. If the two line segments are from two differentroad segments, it is probable that the line segment does not fitthe underlying roads. Conversely, it goes across the roads, asillustrated by L1 and L4 in Figure 10. Such line segments areuseless or even harmful to accurate road recognition. Based onthis observation, we design the first pruning criterion. Considertwo samples fi and fi+1. L(fi, fi+1) should be pruned if thefollowing condition holds,

∣o(fk)− o(L(fi, fi+1))∣ > Θ, k = i, i+ 1. (1)

where Θ is the resolution of vehicle heading direction shown inSection III-A5.

Speed limit criterion: It is reasonable that there is an upperlimit on vehicle driving speed. If the average speed between twoconsecutive samples exceeds the upper limit, such line segmentmay contain errors and therefore should be pruned. In ourexperiments, speed upper limit º is set as the maximum speedlimit of express roads (e.g., 120km/ℎ in the case of Shanghai.)

We apply the two criteria to prune the trajectory portionsshown in Figure 7, and the result is shown in Figure 8. We canfind that most of the messy portions of the trajectories have beenpruned and remaining trajectories are more closely distributedalong the roads.

C. Data ClusteringAfter Step 1, we obtain a set of remaining line segments that

are more closely distributed along the roads. Next, we clusterGPS samples based on the remaining line segments in Step 2.There are two substeps. In the first substep, the basic clusteringprocedure is applied. As a result, the samples are first groupedinto clusters. However, the samples of a resulting cluster maycontain more than one curve and thus cannot be accepted bycurve fitting. Thus, in the second substep, we apply enhancedclustering which inspects each cluster and further divides aproblematic cluster into smaller clusters. As a result, each clustercorresponds to a curve that represents a function.

5

Figure 7: Vehicle trajectories by con-necting consecutive GPS samples fromeach vehicle.

Figure 8: The line segments after prun-ing shown on the map.

Figure 9: Clusters of line segments af-ter clustering. Different clusters are indifferent colors.

Figure 12: Problem with basic clustering: Line segments oftwo different roads with a small separation angle are put inthe same cluster.

Figure 13: Effect illustration of enhanced clustering.

Figure 10: Different pair casesof two consecutive samples.L2 and L3 are kept; L1 andL4 are pruned.

2l

1l

Figure 11: Illustration of max-imum separation angle definedby two line segments, L1 andL2.

1) Basic Clustering: The processes of the basic clusteringalgorithm are as follows. Each line segment itself is initializedto be a cluster. Later, clusters are merged according to a criterionconsidering two metrics. If two line segments are close to eachother and share a similar orientation, they are likely from thesame road and the two clusters they belong to should be merged.

The two metrics are defined as follows.Orientation similarity µ: The orientation similarity µ between

two line segments, L(fi, fi+1) and L(fj , fj+1) (simplified asL1, L2), is defined to be the separation angle of the two linesegments. Thus, µ can be computed as,

µ (L1, L2) ≜ acos(∣−−−−−−→fi+1 − fi ⋅ −−−−−−→fj+1 − fj ∣

d(L1) d(L2)). (2)

Geographical distance ±: The geographical distance ±between two line segments L1 and L2, denoted as ± (L1, L2), isthe shortest distance between them. It is zero if they intersect.We use the algorithm in [18] to calculate ±.

With the availability of the two metrics, we next give theclustering criterion. Consider two clusters, A and B. The twoclusters should be merged if ∃L1 ∈ A, and ∃L2 ∈ B, such that

µ (L1, L2) ≤ µmax (L1, L2) and

± (L1, L2) ≤ ±max (3)

We next explain how to determine the two parameters,µmax (L1, L2) and ±max. It is intuitive that µmax (L1, L2) shouldbe smaller than the maximum that may happen to the two linesegments belonging to the same road. The worst case occurswhen two endpoints of each line segment are separated by theroad and the two line segments cross each other, as illustratedin Figure 11. Let W denote the width of the road. Then,µmax (L1, L2) can be determined by,

µmax (L1, L2) = arcsinD + W

2d(L1)

2

+ arcsinD + W

2d(L2)

2

. (4)

where D is the position resolution of one GPS sample intro-duced in Section III-A3. µmax (L1, L2) does not work when theline segments are short, i.e., two samples are close, and the twoline segments reside in a road intersection. In this case, even ifthe two line segments belong to two different roads, they maystill be clustered together. To address this issue, we introduce asystem-wide threshold µ0. By experimental study, we set µ0 to15 ∘.±max should be the maximum that happens when the two line

segments lie to different sides of the road, respectively. Thus,

±max = 2D +W (5)

Note that the width of roads may differ from each other, andthe information is not available. For simplification, we select Was the width of representative roads, i.e., 30 meters.

We run the basic clustering algorithm on the line segmentsshown in Figure 8 and the result is shown in Figure 9. We can

6

find that the whole set of line segments are divided into differentclusters and a cluster of line segments roughly represents a road.

2) Clustering Refinement: Basic clustering works well forthose roads that are orthogonal to each other. However, it failswhen the separation angle between two roads is relatively small.Two line segments of such two roads may be put into the samecluster by basic clustering. Consequently, a resulting cluster mayactually contain several different roads. This problem with basicclustering is illustrated in Figure 12.

To overcome this serious issue, we propose clustering refine-ment which inspects each cluster produced by basic clusteringand check if there are multiple roads in the cluster. If yes,it further divides the cluster into smaller clusters, so thateach small cluster contains one simple curve that represents afunction.

The central idea of clustering refinement is to probe forbackbone curves representing roads and group the samples closeto each curve into a smaller cluster. The two main steps ofclustering refinements are as follows.

The first step is constructing a polyline to approximate abackbone curve for cluster A. It initializes by searching thesamples in A, and get the westernmost sample if A spanswider in latitude, or the southernmost footprint if A spanswider in longitude. The selected sample is treated as thestarting point of the polyline. Then, the polyline is incrementallyextended by searching for a new qualified sample each time. LetP = (p1, p2, ..., pk) denote the current constructed polyline.We select the closest sample f as pk+1 to extend the polylinefrom point pk, such that

d(f, pk) > dstep and

µ(−−−−−−→pk−1 − pk,

−−−−→pk − f) < " (6)

The process terminates when no more sample can be foundto extend the polyline. Suppose the constructed polyline isP = (p1, p2, ..., pK).

The second step is grouping the samples close to this polylineinto a new cluster A1. Specifically, ∀fi ∈ A should be put intoA1 if the following condition holds.

Minj∈[1,K−1]

d(fi, L(pj , pj+1)) < 2D +W (7)

Then, A = A − A1. Cluster refinement repeats the two stepsiteratively until no more sample exists in A. Suppose A is furtherdivided into k small clusters. Then, we have the following

A =

k∪

i=1

Ai. (8)

By experimental study, we set ¼ = 100m, " = 15∘, and X =5. Figure 13 shows the effect of applying clustering refinementto the cluster shown in Figure 12. We can see that the clusteris further divided into small clusters and each cluster containsa backbone curve that represents a function.

D. Fitting Roads

In Step 3, we compute the centerline of a road by applyingcurve fitting on each cluster derived in Step 2. More specifically,we fit a curve by the samples in a cluster. The computed curveis used to estimate the underlying road segment.

There are many different curve-fitting methods. It is importantto select a good one for road recognition. Since roads canbe in very different shapes, we employ B-spline fitting [19],

121.486 121.488 121.49 121.492

31.255

31.256

31.257

logtitude

latit

ude

footprints1 control point/arch6 control points/arch3 control points/archroad in OpenStreetMap

Figure 14: Samples and centerlines computed with varyingnumber of control points per arch.

which treats the target curve as a smooth piecewise-polynomialfunction.

The key issue in applying B-spline fitting on a given clusteris to determine °, the number of control points of the B-spline.The number of control points strongly influences the shape ofthe fitted B-spline. In [19], the number of control points is setto be 2m

k , in which m is the number of sample points in thecluster and k is a small factor. It is not explained how to setk. To obtain a good fit, however, the number of control pointsshould be set according to the shape of the underlying roadsegment rather than the number of sample points. Too small ortoo large a number of control points may lead to poor fitting.

As illustrated in Figure 14, the selected road in OSM containstwo arcs. By comparison, we find that 3 control points per archoutperforms both 1 or 6 control points per arch. 3 control points(1 per arch) for the road lack goodness of fitting, while 12control points (6 per arch) suffer the problem of over fitting.

Based on this important observation, we propose shape-awarefitting, i.e., to choose the number of control points according tothe number of arches contained in the target curve, ° = 3× e,where e denotes the number of arches contained.

Since we have no access to the target curve (the underlyingroad), it is nontrivial to determine the number of arches given aset of samples. We develop an algorithm to derive this number.The central idea of this algorithm is to find the places where thetarget curve changes heading direction. Note that if ∣Π∣ < ',there will be no fitted road segment for the sample set. In ourexperiment we set Γ = ' = 5, L = 200m.

V. EXPERIMENTS AND EVALUATION

In this section, we first present the experimental setup, andthen define three metrics for performance evaluation. After that,we present and discuss the evaluation results.

A. Methodology and Experimental Setup

In order to conduct empirical performance study, we use thereal data set of taxis’ GPS traces and the Shanghai digital mapobtained from OSM [1]. The elements in the digital map weuse include polylines representing road segments, along with theroad types of road segments. There are mainly three road types,i.e., arterial, secondary main, and branch. We will evaluate ourroad recognition algorithms on different road types. We considerthe road segments in a selected area of 14.5km×14km as shownin Figure 15 for evaluation. In Figure 15 the black dashed lines

7

Algorithm of determining the number of archesInput:

Π: the cluster of samples;': minimum support in Π;L: the span of one section of a cluster;Γ: minimum support in one section;

Output:!: The number of arches in Π;

Main Procedure:1. if ∣Π∣ < '2. return 0; // No Fitted road will be produced in this case.3. end if4. if samples in Π span wider in latitude.5. ¹0 = the westernmost sample;6. else7. ¹0 = the southernmost sample;8. end if9. S1 = {the closest Γ samples in Π to ¹0};

10. S1 = S1∪ {p∣p ∈ Π, d(p, ¹0) ≤ L};

11. ¹1 = center point of all samples in S1;12. Π = Π− S1;13. k = 1;14. while Π ∕= NULL15. k ++;16. Sk = {the closest Γ samples in Π to ¹k−1};17. Sk = Sk

∪ {p ∣ p ∈ Π, d(p, ¹k−1) ≤ 32L};

18. ¹k = center point of all samples in Sk;19. Π = Π− Sk;20. end while21. Generate vectors, −→Ài = ¹i+1 − ¹i, i ∈ [1, k − 1];22. ! = 1;23. for i = 2 : k − 224. if (−−→Ài−1 ×−→Ài) ⋅ (−→Ài ×−−→Ài+1) > 025. ! ++; // new arch appears26. end if27. end for28. return !;

end

Figure 15: Roads recognized by our algorithm (solid lines) androads on the OSM map (dashed lines).

are the roads in OSM and the red ones are those recognizedby our algorithm. There are 440 arterial road segments and thetotal length of these road segments is 106.3km. There are 1325secondary main road segments and the total length is 348.7km.There are 97 branch road segments and the total length is23.1km.

For performance evaluation, we design three metrics: cov-erage, false positive rate and accuracy. In order to giveprecise definitions, we introduce some notations. Let Φ ={P1, P2, ..., Pk} denote the set of all road segments in OSM,where Pi is a polyline representing one road segment. Let

Ψ denote the set of all recognized road segments. We nextlabel roads successfully recognized by matching the polylinesin Φ with Ψ. Let Φs denote the set of polylines representingmatched road segments in Φ, and Ψs denote the set of polylinesrepresenting matched road segments in Ψ. Next, we give thederivation of Φs and Ψs. Suppose polyline P1 ⊆ P ′

1 ∈ Φ andpolyline P2 ⊆ P ′

2 ∈ Ψ. P1 ∈ Φs and P2 ∈ Ψs if the followingtwo conditions hold,

len(P1)

len(P2)= 1, (9)

dH(P1, P2) < ». (10)

where dH(P1, P2) is the Hausdorff distance between P1 and P2

[14]. In our experiments we set » = 50m.

Definition 1 (Coverage). We define coverage ½ as follows,

½ =

∑P∈Φs

len(P )∑P∈Φ len(P )

(11)

Note that if the false negative rate is higher, then the coverageis lower.

Definition 2 (False positive rate). We define the false positiverate $ as follows,

$ = 1−∑

P∈Ψslen(P )∑

P∈Ψ len(P )(12)

We evaluate accuracy in terms of separation distance, horizon-tal shift and vertical shift. Consider a pair of matched polylines,P1 ∈ Φs and P2 ∈ Ψs. The separation distance of the twopolylines is defined as,∫

p(x,y)∈P2d (p(x, y), P1) dxdy

len(P2)(13)

where p(x, y) denotes a point with latitude x and longitude y.We define the horizontal shift between P1 and P2 as follows,

∫p(x,y)∈P2,p(x′,y′)∈P1

(x′ − x)dxdy

len(P2)(14)

where d(p(x, y), p(x′, y′)) = d(p(x, y), P2). The vertical shiftcan be defined similarly.

B. Compared Algorithms

We compare our road recognition technique to an alternativealgorithm called simple clustering and fitting (SCF). In order toexplore the effectiveness of clustering refinement, we use twoversions of our technique: one with only basic clustering andthe other enhanced with clustering refinement.

∙ Simple Clustering and Fitting (SCF). As shown inFigure 2, although error exists in the heading directions ofGPS samples, we can see that most samples on the sameroad segments share similar heading directions. Thus, wecan cluster samples based on their distance and headingdirections, e.g., two samples with similar heading directionsand near to each other should be allocated into one cluster.We partition the samples of one cluster into subclusters atpositions where the target curve changes directions, andapply quadratic polynomial fitting method to get one fittedroad segment for each sub-cluster.

∙ Basic Clustering and Adaptive Fitting (BAF) . In thisalgorithm, we execute all the three steps shown in SectionIV except the clustering refinement.

8

(a) arterial roads coverage. (b) false positive. (c) separation distance.

Figure 16: Comparison of different algorithms.

∙ Clustering Refinement and Adaptive Fitting (CAF) . Inthis algorithm, we execute all the steps shown in SectionIV, including data pruning, basic clustering, clusteringrefinement, and adaptive B-spline fitting.

C. Comparative Study

We compare the three algorithms in terms of coverage, falsepositives, and separation distance and the results are shown inFigure 16.

In Figure 16a we compare the algorithms against coverage ofarterial roads as the number of taxis and time window of datacollection vary. We can see that BAF and CAF significantlyoutperform SCF, especially when the number of taxis is smalland the time window is short. In addition, the convergencespeeds for BAF and CAF are almost the same, and they aremuch faster than SCF. With N = 2000, the coverage of arterialroads reaches 93% for CAF when T = 1.5 hours, but nearlykeeps unchanged when T is longer. Importantly, CAF achievesthe best performance of coverage. When there are abundant GPSsamples, BAF suffers from the problem shown in Figure 12.

In Figure 16b, we evaluate the algorithms in terms of falsepositive rate. We can see that CAF has the lowest false positiverate and SCF performs the worst. When the number of GPSsamples in Ω(T0, Tnow) is relatively small, BAF and CAF sharesimilar rates of false positives. With increasing N and T , therates of false positives of both BAF and CAF increase slightly,mainly because of missing roads in OSM. This is further studiedin Section V-E. SCF produces higher false positive rates thanBAF and CAF.

In Figure 16c we examine the performance of the threealgorithms in terms of separation distance. We can see that bothBAF and CAF achieve much shorter separation distances thanSCF. This shows shape-aware B spline fitting performs wellfor roads with various shapes. With increasing N and T , theseparation distances converge to a reliable value.

D. Coverage

We next further investigate the performance of coverageachieved by CAF on the three road types, i.e., arterial roads,secondary main roads and branch roads, respectively. The resultsare shown in Figure 17.

As expected, arterial roads gain highest coverage, secondarymain roads follows, and branch roads get the least, becausethe opportunities for vehicles traveling on the three types roadsdecrease correspondingly. However, for a certain type roads,coverage will increase and keep nearly stable even when N gets

Figure 17: Coverage ½ for different road types.

larger or T gets longer. For example, given N = 2, 000 taxis,coverage increases to 93% when T increases from 0.5 hours to1.5 hours for arterial roads, but keeps nearly unchanged whenT is longer. The reason is that some proportion of the roads arenot covered by taxis. The same phenomenon applies to othertwo types of roads as well.

Moreover, arterial roads achieve the stable coverage valuewith less data than secondary main roads do. The reason is thatarterial roads has a larger opportunity to have vehicles runningon than secondary main roads. We do not compare branch roadshere as their total length is short shown in Section V-A, and canachieve the low stable coverage value, i.e., 41%, with a smallnumber of samples.

E. False Positive

We next investigate the false positive rate $ achieved by ouralgorithm. Figure 16b depicts the distribution of false positiverates. For CAF, we can see that the false positive rate is alwaysless than 5% even when N = 2000 and T = 1.5 hours. Inaddition, the false positive rate ascends slightly with increasingN and T .

We next show that the false positive rate computed with theOSM map is actually larger than the real false positive rate,because the OSM map is typically out of date and inconsistentwith the real roads. One such example is shown in Figure 18.Figure 18a shows all the line segments after data pruning,together with the dashed road segments shown in OSM. Fig-ure 18b shows the real roads and the roads recognized. We caneasily find that our algorithm successfully recognizes the newroads that do not appear in the OSM map. Such false positive

9

(a) Line segments together with dashed road segments in OSM.

Ro

Roads

oads reco

in OSM

ognized

Real Rooads

(b) Real roads and the recognized roads by CAF.

Figure 18: One counter example of a false positive for theproposed algorithm.

cases can be valuable to detection of new roads that have notbeen included in the outdated maps.

F. Accuracy

We finally investigate the performance of accuracy achievedby our algorithm. It is actually difficult to compute the errorof the recognized roads since no maps are truly accurate. Wefirst show the separation distance, horizonal and vertical shiftsof the roads recognized by our algorithm compared to the roadson the OSM map. Next, we show that the roads recognized byour algorithms are more accurate than the roads on the OSMmap.

Figure 19 shows the distributions of separation distance,horizontal and vertical shifts of all road segments Pi ∈ Ψs,with 1.5 hours of data from 2, 000 taxis. Figure 19c is thedistribution of separation distance. We can see that the distanceis mainly distributed in interval [0, 40]. In Figures 19a and 19b,we find that the distributions of horizontal and vertical shifts ofrecognized roads from the roads on the OSM map are mainly ininterval [−15, 5] and [−20, 5], respectively. Thus, the OSM mapshifts toward northeast, consistent with our visual inspection.

We then look at the question: which roads are more accurate?In [21], a novel method is designed for comparing real roadswith roads on OSM, with a practical assumption that GPSpositions follow the Gaussian distribution [13]. It is provedthat roads on the OSM map deviate from the real positionsof the roads, and one good example is shown in Figure 6.The distributions of horizonal and vertical shifts of real roadsfrom the roads on the OSM map are both mainly in interval[−15, 5]m. The roads recognized by our algorithm agree withthe conclusion derived in [21]. Thus, the roads recognize by ourroad recognition algorithm with corase-grained vehicular GPStraces are more accurate than the roads on the OSM map.

VI. RELATED WORK

The traditional method of constructing road maps is geo-graphical survey, which it is a time-consuming and expensive

process. Moreover, these road maps cannot be updated timelyin circumstances that new roads are constructed, existing roadsare permanently or temporarily closed. OpenStreetMap [12]is designed for editing maps by the public. A considerablenumber of volunteers are editing the maps manually based onuploaded GPS traces or aerial imagery. In [11], the authorspresent the essential step in automate map revision with satelliteimagery, matching the roads present on both images and the mapdatabase.

Map building using offline GPS trajectories with high sam-pling frequencies and high accuracy have been studied [3,7, 8, 19, 22, 25]. When the actual drive path of the GPStrace is unknown, data clustering is needed to group togethertraces that are likely from the same road [7, 19, 22, 25]. Highsampling frequencies and highly accurate GPS positions benefitthe data clustering process significantly, however, it is still verychallenging. In [19, 25], the sampling interval is 1 second to 10seconds, and the clustering is assisted by a base map. In [7],survey vehicles report one GPS report per second with highaccuracy, and gravitational and attraction forces are simulatedto cluster GPS traces. With a sampling interval around 1 minute,clustering GPS traces is extremely difficult. In our GPS traces, itis not uncommon that a taxi generates no more than one sampleper road segment. The trajectory of one vehicle bears little orno similarity with the true road geometry. In our algorithm weinnovate both Basic Clustering and Clustering Refinement to dothe work and achieve good performance.

There have been several methods for fitting points to curves.In [7], a graph generation algorithm based on geometry isproposed for extracting roads. However, it is infeasible for low-sampling-rate and error-prone GPS traces. B-spline fitting is apopular method for approximating highways from GPS traces[3, 8, 19]. In our work, we present an adaptive shape-awarefitting method which can achieve better fitting performance forroads with various shapes.

Some other existing studies infer road intersections and lanestructures [9, 10]. Our work is complementary to these existingstudies.

Recently, probe vehicles have been employed for variousmobile sensing applications [5, 6, 15, 16, 23, 24, 26]. In[5, 6], the authors propose a data management system calledCarTel for querying and collecting mobile vehicles sensory data,which is made available to application development. MobEye[15] is a protocol designed to opportunistically diffuse sensorydata summaries among mobile vehicles and to create index forquerying monitoring data. Deploying probe vehicles for roadtraffic sensing has been studied in [16, 23, 26], and compressivesensing [16] and multiple-channel singular spectrum analysis[26] have been utilized. In [24], T-Drive mines driving directionsfrom historical GPS trajectories of a large number of taxis andgives good recommendation to taxi drivers.

VII. CONCLUSION AND FUTURE WORK

With more and more vehicles equipped with commodity GPStracking devices, there is considerable interest in building mapswith vehicular GPS traces. In this paper we have presented aroad recognition algorithm using coarse-grained GPS traces. Themain challenges include inaccurate GPS positions, low samplingrates and complex road shapes. Our algorithm overcomes thechallenges by pruning low-quality samples, clustering relevantsamples and then applying shape-aware fitting. Our algorithmalso exploits heading information available in GPS data. In order

10

−50 0 500

30

60

90

120

num

ber

of r

oads

(a) Horizontal shift (m)−80 −40 0 40 80

0

30

60

90

120

num

ber

of r

oads

(b) Vertical shift (m)0 40 80

0

10

20

30

40

50

num

ber

of r

oads

(c) Separation distance (m)

Figure 19: Shifts and separation distances of recognized roads from roads on the OSM map.

to address the issue incurred by roads with small inner angles,we design a novel technique to distinguish such roads whenclustering samples. With real GPS traces from around 2,300taxis in Shanghai, our empirical evaluation demonstrates thatroads calculated by our algorithm is more accurate than themap on OSM and provides high accuracy, wide coverage, andlow false positive rate.

REFERENCES

[1] OpenStreetMap. http://www.openstreetmap.org.[2] More than 50 crashes on Bay Bridge curve. http://www.usatoday.com/

news/nation/2009-11-18-bay-bridge N.htm, 2009. USA Today.[3] D. Ben-Arieh, S. Chang, M. Rys, and G. Zhang. Geometric modeling of

highways using global positioning system data and B-Spline approxima-tion. Journal Transportation Engineering, 130(5):632–636, 2004.

[4] R. L. Bertini and S. Tantiyanugulchai. Transit buses as traffic probes:Use of geolocation data for empirical evaluation. Transportation ResearchRecord, 1870:35–45, 2004.

[5] V. Bychkovsky, K. Chen, M. Goraczko, H. Hu, B. Hull, A. Miu, E. Shih,Y. Zhang, H. Balakrishnan, and S. Madden. The CarTel mobile sensorcomputing system. In Proc. ACM SenSys, 2006.

[6] V. Bychkovsky, K. Chen, M. Goraczko, H. Hu, B. Hull, A. Miu, E. Shih,Y. Zhang, H. Balakrishnan, and S. Madden. Data management in theCarTel mobile sensor computing system. In Proc. ACM SIGMOD, 2006.

[7] L. Cao and J. Krumm. From GPS traces to a routable road map. In Proc.ACM GIS, 2009.

[8] M. Castro, L. Iglesias, R. Rodrıuez-Solano, and J. A. Sanchez. Geometricmodelling of highways using global positioning system (gps) data andspline approximation. Transportation Researchs, 14(4):233 – 243, 2006.

[9] Y. Chen and J. Krumm. Probabilistic modeling of traffic lanes from GPStraces. In ACM GIS, 2010.

[10] A. Fathi and J. Krumm. Detecting road intersections from GPS traces. InProc. 6th International Conference on Geographic Information Systems,2010.

[11] R. Fiset, F. Cavayas, M. Mouchot, B. Solaiman, and R. Desjardins. Map-image matching using a multi-layer perceptron: the case of the roadnetwork. Journal of Photogrammetry and Remote Sensing, 53(2):76–84,1998.

[12] M. M. Haklay and P. Weber. OpenStreetMap: User-generated street maps.IEEE Pervasive Computing, 7:12–18, 2008.

[13] J. Hightower and G. Borriello. Location systems for ubiquitous computing.Computer, 34(8):57–66, 2001.

[14] D. P. Huttenlocher, G. A. Klanderman, and W. Rucklidge. Comparingimages using the Hausdorff distance. IEEE Transactions on PatternAnalysis and Machine Intelligence, 15(9):850–863, 1993.

[15] U. Lee, E. Magistretti, M. Gerla, P. Bellavista, and A. Corradi. Dissem-ination and harvesting of urban data using vehicular sensing platforms.IEEE Transactions on Vehicular Technology, 58(2):882–901, 2009.

[16] Z. Li, Y. Zhu, H. Zhu, and M. Li. Compressive sensing approach to urbantraffic sensing. In Proc. IEEE ICDCS, 2011.

[17] K. Liu, T. Yamamoto, and T. Morikawa. Feasibility of using taxi dispatchsystem as probes for collecting traffic information. Journal of IntelligentTransportation Systems, 13(1):16–27, 2009.

[18] V. Lumelsky. On fast computation of distance between line segments.Information Processing Letters, 21(2):55–61, 1985.

[19] S. Schroedl, K. Wagstaff, S. Rogers, P. Langley, and C. Wilson. MiningGPS traces for map refinement. Data Mining and Knowledge Discovery,9:59–87, 2004.

[20] F. van Diggelen. GNNS accuracy: Lies, damn lies, and statistics. GPSWorld, pages 26–32, 2007.

[21] Y. Wang, Y. Zhu, Z. He, Y. Yue, and Q. Li. Challenges and opportunities inexploiting large-scale GPS probe data. Technical Report HPL-2011-109,HP Labs, 2011.

[22] S. Worrall and E. Nebot. Automated process for generating digitisedmaps through GPS data compression. In Proc. Australasian Conferenceon Robotics and Automation, 2007.

[23] J. Yoon, B. Noble, and M. Liu. Surface street traffic estimation. In Proc.ACM MobiSys, 2007.

[24] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang.T-drive: driving directions based on taxi trajectories. In Proc. ACM GIS,2010.

[25] L. Zhang, F. Thiemann, and M. Sester. Integration of GPS traceswith road map. In Proc. 2nd International Workshop on ComputationalTransportation Science, 2010.

[26] H. Zhu, Y. Zhu, M. Li, and L. Ni. SEER: Metropolitan-scale trafficperception based on lossy sensory data. In Proc. IEEE INFOCOM, 2009.

Road Recognition using Coarse-grained Vehicular Traces · accuracy. When there are 2,000 taxis and...

Documents

Transcript of Road Recognition using Coarse-grained Vehicular Traces · accuracy. When there are 2,000 taxis and...