Using Incomplete Information for Complete Weight Annotation of Road ...mkaul/papers/tkde2013.pdf ·...

For Peer Review Only

Using Incomplete Information for Complete Weight Annotation of Road Networks

Journal: Transactions on Knowledge and Data Engineering

Manuscript ID: TKDE-2012-07-0548.R2

Manuscript Type: Regular

Keywords:

H.2.8.o Spatial databases and GIS < H.2.8 Database Applications < H.2 Database Management < H Information Technology and Systems, H.2.8.d Data mining < H.2.8 Database Applications < H.2 Database Management < H Information Technology and Systems, G.3.b Correlation and regression analysis < G.3 Probability and Statistics < G Mathematics of Computing

Transactions on Knowledge and Data Engineering


JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Using Incomplete Information for CompleteWeight Annotation of Road Networks

Bin Yang, Manohar Kaul, Christian S. Jensen, Fellow, IEEE

Abstract—We are witnessing increasing interests in the effective use of road networks. For example, to enable effective vehiclerouting, weighted-graph models of transportation networks are used, where the weight of an edge captures some cost associatedwith traversing the edge, e.g., greenhouse gas (GHG) emissions or travel time. It is a precondition to using a graph model forrouting that all edges have weights. Weights that capture travel times and GHG emissions can be extracted from GPS trajectorydata collected from the network. However, GPS trajectory data typically lack the coverage needed to assign weights to all edges.This paper formulates and addresses the problem of annotating all edges in a road network with travel cost based weights froma set of trips in the network that cover only a small fraction of the edges, each with an associated ground-truth travel cost. Ageneral framework is proposed to solve the problem. Specifically, the problem is modeled as a regression problem and solvedby minimizing a judiciously designed objective function that takes into account the topology of the road network. In particular,the use of weighted PageRank values of edges is explored for assigning appropriate weights to all edges, and the propertyof directional adjacency of edges is also taken into account to assign weights. Empirical studies with weights capturing traveltime and GHG emissions on two road networks (Skagen, Denmark, and North Jutland, Denmark) offer insight into the designproperties of the proposed techniques and offer evidence that the techniques are effective.

Index Terms—Spatial databases and GIS, Correlation and regression analysis.

!

1 INTRODUCTION

REDUCTION in greenhouse gas (GHG) emissionsis crucial in combating global climate change. For

example, the EU has committed to reduce GHG emis-sions to 20% below 1990 levels by 2020 [1]. To achievethese reductions, the transportation sector needs toachieve reductions. For example, in the EU, emissionsfrom transportation account for nearly a quarter of thetotal GHG emissions [2], making transportation thesecond largest GHG emitting sector, trailing only theenergy sector.

While improved vehicle and engine design arelikely to yield GHG emission reductions, eco-routingis readily deployable and is a simple yet effective ap-proach to reducing GHG emissions from road trans-portation [3]. Specifically, eco-routing can effectivelyreduce fuel usage and CO2 emissions. Studies suggestthat by providing eco-routes to drivers, approximately8–20% in fuel savings and lower CO2 emissions arepossible in different settings, e.g., during peak versusoff-peak hours, on highways versus areal roads, forlight versus heavy duty vehicles [4], [5]. For example,an interesting municipal solid waste collection sce-nario, where a truck collects solid waste from severallocations on Santiago Island, demonstrates a 12% fuelreduction due to eco-routes [6].

• B. Yang, M. Kaul and C.S. Jensen are with the Department ofComputer Science, Aarhus University, Aarhus DK-8200, Denmark.E-mail: {byang, mkaul, csj}@cs.au.dk.

Vehicle routing relies on a weighted-graph repre-sentation of the underlying road network. To achieveeffective eco-routing, it is essential that accurateedge weights that capture environmental costs, e.g.,fuel consumption or GHG emissions, associated withtraversing the edges are available. Given a graphwith appropriate weights, eco-routes can be efficientlycomputed by existing routing algorithms, e.g., basedon Dijkstra’s algorithm or the A! algorithm. How-ever, accurate weights that capture environmentalimpact are not always readily available for a roadnetwork. This paper addresses the task of obtainingsuch weights for a road network from a collection ofmeasured (trip, cost) pairs, where the cost can be anycost associated with a trip, e.g., GHG emissions, fuelconsumption, or travel time.

Because the trips given in the input collection ofpairs generally do not cover all edges of the roadnetwork and also do not cover all times of the day,data sparsity is a key problem. The cost of a trip, e.g.,GHG emissions, differs during peak versus off-peakhours. Thus, it is inappropriate to use costs associatedwith peak-hour trips for obtaining edge weights to beused for eco-routing during off-peak hours.

Considering the road network and trips shown inFig. 1, assume that the GHG emissions of trip 1(traversed from 7:30 to 7:33) and trip 2 (traversedfrom 23:15 to 23:17) are also given, and assume thatwe are interested in assigning GHG emission weightsto all edges in the network. The assignment of theseweights to a large number of edges, e.g., BC, BD, EG,and FG, cannot be done directly since they are not

Page 1 of 35 Transactions on Knowledge and Data Engineering

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



AB

C

D

F GH

I E

J

Trip 1(7:30 to 7:33)Trip 2 (23:15 to 23:17)

Road

Peak: [6:00, 8:00]Off-Peak: [0:00, 6:00), (8:00, 24:00)

Fig. 1. Trips on A Road Network

covered by any trip. However, for example, BD can beannotated by considering its neighbor road segmentAB which is covered by trip 2.

Assuming that the period from 6:00 to 8:00 is thesole peak-hour period (the remaining times being off-peak), trip 1 is not useful for assigning an off-peakweight to the edge AE because trip 1 traversed AEduring peak hours. By taking into account the off-peak weights of IA and AB (covered by trip 2), itis, however, possible to obtain an off-peak weight forAE.

This paper proposes general techniques that takeas input (i) a collection of (trip, cost) pairs, wheretrip captures the edges used and the times when theedges are traversed and the cost represents the costof the entire trip; and (ii) an unweighted graph modelof the road network in which the trips occurred. Thetechniques then assign travel cost based weights to alledges in the graph.

To the best of our knowledge, this paper is the firstto study complete weight annotation of road networksusing incomplete information. In particular, the papermakes four contributions. First, a novel problem, roadnetwork weight annotation, is proposed and formal-ized. Second, a general framework for assigning time-varying trip cost based weights to the edges of theroad network is presented, along with supportivemodels, including a directed, weighted graph modelcapable of capturing time-varying edge weights and atrip cost model based on time varying edge weights.Third, two novel and judiciously designed objectivefunctions are proposed to contend with the data spar-sity. A weighted PageRank-based objective functionaims to measure the variance of weights on road seg-ments with similar traffic flows, and a second objec-tive function aims to measure the weight difference onroad segments that are directionally adjacent. Fourth,comprehensive empirical evaluations with real datasets are conducted to elicit pertinent design propertiesof the proposed framework.

The remainder of this paper is organized as follows.Following a survey of related work in Section 2,Section 3 covers problem definition and a generalframework for solving the problem. Section 4 detailsthe objective functions. Section 5 reports the empiricalevaluation, and Section 6 concludes and discussesresearch directions.

2 RELATED WORK

Little work has been done on weight annotation ofroad neworks. Trip cost estimation is a core compo-nent of our weight annotation solution. Given a setof (trip, cost) pairs as input, trip cost estimation aimsto estimate the costs for trips that do not exist in thegiven input set. Weight annotation can be regarded asa generalized version of trip cost estimation, since ifpertinent weights can be assigned to a road network,the cost of any trip on the road network can beestimated. For example, if a GHG emissions basedweighted graph is available, the GHG emissions of acertain trip can be estimated as the sum of the weightsof the road segments that the trip traverses.

Most existing work on trip cost estimation [7], [8],[9], [10] focuses on travel-time estimation. In otherwords, their work focuses on travel time as the tripcost. In general, the methods for estimating the traveltimes of trips can be classified into two categories: (i)segment models and (ii) trip models.

Segment models [9], [10], [11], [12] concern traveltime estimation for individual road segments. Forexample, observers (e.g., Bluetooth sensors or loopdetectors deployed along road segments) monitorthe traffic on road segments, recording the flows ofvehicles along the road segments. Thus, travel-timeestimation tends to concern particular road segments.For example, some studies model travel time on a par-ticular road segment as a time series and apply autore-gressive models [9] to estimate the travel time on theroad segment. T-Drive [10] models time-dependenttravel time distributions on road segments using setsof histograms and enables the inference of futuretravel times using Markov chains [13]. One studyincorporates Lagrangian measurements [12] into exist-ing traffic flow models for freeways to estimate traveltime distributions on specific freeways.

Segment models assume “hot” road segmentswhere, preferably, substantial data is available. How-ever, far from every road segment may have enoughhistorical data in practical settings, e.g., due to thelimited deployment of costly sensors. Segment modelsare not well suited for the weight annotation problembecause the given (trip, cost) pairs typically fail tocover the whole network, meaning that many roadsegments lack the data needed to apply such models.

The trip models focus on estimating the costs ofindividual trips. Specifically, the costs of trips are con-sidered more interesting than the costs of individualroad segments. Given a collection of trips and theircorresponding travel times, one study [8] proposes aGaussian process regression based method to predictthe travel times for unseen trips. However, the studyhas the limitation that all the trips are required toshare the same source and target. This limitationrenders the study of limited interest to us, since weaim at annotating every edge with a pertinent weight.

Page 2 of 35Transactions on Knowledge and Data Engineering

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



Trajectory regression [7] was proposed recently toinfer the travel times of arbitrary trips. The methodis able to estimate the travel times of trips consistingof road segments with no or little traversal historyby considering the travel time correlation of spatiallyadjacent road segments.

Trajectory regression is the most related method toour weight annotation problem. However, our studydistinguishes itself with several unique characteristics.First, we propose a general framework for annotatingedges in a road network with a range of trip costbased weights and are not constrained to travel time.Second, we identify the cost correlation of road seg-ments sharing similar traffic flows, and we quantifythis by using weighted PageRank values. Third, weconsider the temporal cost correlation of adjacent roadsegments. For example, although two road segmentsAB and BC are adjacent, the cost of traversing ABduring peak hours is not necessarily correlated tothe cost of traversing BC during off-peak hours.Fourth, we take into account the directionality of roadsegments and consider only directional adjacency whendetermining the cost correlation of spatially adjacentroad segments. Last but not least, we conduct compre-hensive experiments on real data sets (real trips andreal road networks) to demonstrate the effectivenessof annotating road networks with both travel timebased weights and GHG emissions based weights.The earlier study on trajectory regression [7] considersonly synthetic data and estimates only travel times oftrips.

In the intelligent transportation system researchfield [3], [14], [15], other travel costs (besides traveltime) of trips are studied. For example, fuel consump-tion and GHG emissions of a trip can be computedbased on instantaneous vehicle velocities and accel-erations, the slopes of the road segments traversed,and the engine type. However, these methods aredesigned to estimate the costs of individual trips andare not readily applicable to the problem of annotatinggraph edges with trip cost based weights, notablyedges that do not have any traversed trips.

3 PRELIMINARIES

We cover the modeling that underlies the proposedframework, and we provide an overview of the frame-work and its setting.

We use blackboard bold upper case letter for sets,e.g., E, bold lower case letters for vectors, e.g., d, andbold upper case letters for matrices, e.g., M. Unlessstated otherwise, the vectors used are column vectors.The i-th element of vector d is denoted as d[i], and theelement in the i-th row and j-th column of matrix Mis denoted as M[i, j]. Matrix MT is M transposed.An overview of key notation used in the paper isprovided in Table 1.

Notation Description

G, G! The primal graph and the dual graph.G!

k The dual graph in traffic category tag tagk .V, E The vertex set and the edge set.V!, E! The dual vertex set and the dual edge set.d The cost variable vector for all edges.PRk(v

!i) The weighted PageRank value of dual

vertex v!i in traffic category tag tagk .

TABLE 1Key Notation

3.1 Modeling a Temporal Road Network

A road network is modeled as a directed, weightedgraph G = (V, E, L, F , H), where V and E are thevertex and edge sets, respectively; L is a function thatrecords the lengths of edges; F is a function that mapstimes to traffic categories; and H is a function thatassigns time-varying weights to edges. We proceed tocover each component in more detail.

A vertex vi ! V represents a road intersection or anend of a road. An edge ek ! E " V # V is definedby a pair of vertices and represents a directed roadsegment that connects the (intersections representedby) two vertices. For example, edge (vi, vj) representsa road segments that enables travel from vertex vito vertex vj . For convenience, we call this graphrepresentation of a road network the primal graph.

Fig. 2 captures the upper right part of the roadnetwork shown in Fig. 1 in more detail. Here, Avenue 1and Avenue 2 are bidirectional roads, and Street 3 is aone-way road that only allows travel from vertex Bto vertex D.

The corresponding primal graph is shown in Fig. 3.In order to capture the bidirectional Avenue 1, twoedges (A,B) and (B,A) are generated. Since Street 3is a one-way road, only one edge, (B,D), is created.

Avenue 1

Avenue 2

BA

C

D

Stre

et 3

Fig. 2. Road Network

A

B

C

D

Fig. 3. Primal Graph

It is essential to model a road network as a directedgraph because the cost associated with traveling intwo different directions may differ very substantially.For example, traveling uphill is likely to have a higherfuel cost than traveling downhill. As another example,the congestion may also vary greatly for the twodirections of a road.

Function L : E $ R takes as input an edge andoutputs the length of the road segment that the edgerepresents. If road segment AB is 135 meters long, wehave G.L((A,B)) = G.L((B,A)) = 135.


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



Next, the cost of traversing the same edge maydiffer across time. This is typically due to varyingdegrees of congestions. Thus, GHG emissions or fuelconsumption are likely to differ during peak versusoff-peak times. To this end, function F : TD $ TAGSmodels the varying traffic intensity during differentperiods. Specifically, F partitions time TD and assignsa traffic category tag in TAGS to each partition. Thegranularity of the tags are chosen so that the traf-fic intensity can be assumed to be constant duringthe time associated with the same tag. For example,F ([0:00, 7:00)) = OFFPEAK, F ([7:00, 9:00)) = PEAK,F ([9:00, 17:00)) = OFFPEAK, etc.

Finally, function H : E # TAGS $ R assigns timedependent weights to all edges. In particular, H takesas input an edge and a traffic tag, and outputs theweight for the edge during the traffic tag.

Specifically, G.H(ei, tagj) = d(ei, tagj)· G.L(ei),where d(ei, tagj) indicates the cost per unit length oftraversing edge ei during tag tagj and G.L(ei) is thelength of edge ei. To maintain the different costs ondifferent edges during different traffic tags, functionH maintains |E|·|TAGS | cost variables, denoted asd(ei, tagj) (where 1 ! i ! |E| and 1 ! j ! |TAGS |).

We organize all the cost variables into a cost vectord ! R(|E|·|TAGS |) and d= [d(e1, tag1), . . ., d(e|E|, tag1),d(e1, tag2), . . ., d(e|E|, tag2), . . ., d(e1, tag|TAGS|), . . .,d(e|E|, tag|TAGS|)]

T. The x-th element of the vector, i.e.,d[x], equals d(ei, tagj) and x = pos(i, j) = (j % 1)·|TAGS| + i. Note that if the cost vector d becomesavailable, the function G.H also becomes available.

The proposed model is attractive in our setting. Itis simpler than existing models capable of capturingtime-varying weights (e.g., time-expanded graphs [16]and time-aggregated graphs [17]), and yet it is suffi-ciently expressive for the problem we solve.

3.2 Trips and Trip CostsSince vehicle tracking using GPS is widespreadand growing, we take into account trips derivedfrom GPS observations. A GPS trajectory gpsTr =(gps1, gps2, . . . , gpsn) is a sequence of GPS observa-tions, where a GPS observation gpsi specifies the loca-tion of a vehicle at a particular time point. After mapmatching and some pre-processing, a GPS trajectory istransformed into a trip t = (l1, l2, . . . , lm) that consistsof a sequence of link records li of the form:

link record li : (e, ts, te),

where e ! E indicates an edge in G and ts andte indicate the time points of the first and last GPSobservations on edge ei.

If a graph G is available that contains relevant edgecosts, the cost of a trip t = (l1, l2, . . . , lm) can beestimated by Equation 1.

cost(t) =!

li"t

!

tagj"TAGS

weight(li, tagj) ·G.H(li.e, tagj), (1)

where

weight(li, tagj) =

!I"G.F!1(tagj)

|I & [li.ts, li.te]||[li.ts, li.te]|

.

Here, G.F#1 indicates the inverse function of F de-fined in G, which takes as input a traffic tag and out-puts the set of its corresponding time intervals. Next,| · | denotes the length of an interval. For example,given a trip that contains link record li = (ej , 6 :51, 7 : 05) and the traffic tags given in Section 3.1,the cost of the trip is 10

15 · G.H(ej ,OFFPEAK) + 515

· G.H(ej ,PEAK) = 1015 · d(ej ,OFFPEAK) · G.L(ej)+ 5

15 ·d(ej ,PEAK) · G.L(ej).

3.3 Framework OverviewFig. 4 gives an overview of the framework for assign-ing trip cost based weights to a road network. Varioustypes of raw data collected from a road network, suchas GPS observations with corresponding CAN busdata and sensor data, are fed into a pre-processingmodule. While the GPS observations are obligatory,the CAN bus and sensor data are optional.

Pre-Processing Module

GPS observations

A set of (trip, cost) pairs {(t(i), c(i))}

CAN bus data

Sensor data

Weight Annotation Module

G''(V, E, L, F, null)

G(V, E, L, F, H)

optionalobligatory

Fig. 4. Framework Overview

Pre-processing module: The GPS observations aremap matched and transformed into trips as definedin Section 3.2. Next, a cost is associated with each trip.If only GPS observations are available, some costs,e.g., travel time, can be associated with trips directly.Other costs, e.g., GHG emissions, can be derived. Forexample, models are available in the literature that areable to provide an estimate of a trip’s GHG emissionsand fuel consumption based on the GPS observationsof the trip [3]. If CAN bus data and sensor data arealso available along with the GPS data, actual andmore accurate fuel consumption and GHG emissionscan be obtained directly, and thus can be associatedwith trips.

The pre-processing module outputs a set of (trip,cost) pairs {(t(i), c(i))}, which then serve as inputto the edge annotation module. For example, if thegoal is to assign GHG emissions based weights, costvalue c(i) indicates the GHG emissions of trip t(i).Note that the cost c(i) is the total cost associatedwith the i-th trip, meaning that the cost for each


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



individual link record in the i-th trip is not requiredto be known. This makes it easier to collect (trip, cost)pairs. Because pairs may be obtained in wide varietyof ways, the proposed framework has the potentialfor wide applicability.

Weight annotation module: The (trip, cost) pairsalong with a corresponding un-weighted graph G$$ =(V,E, L, F,null) are fed into the weight annotationmodule. This module assigns pertinent weights to theedges of the graph, and it outputs an weighted graphG = (V,E, L, F,H).

Recall that function G.H from Section 3.1 is definedby the cost vector d. Given a set of (trip, cost) pairsTC = {(t(i), c(i))}, the core task of this module is toestimate appropriate cost variables in vector d. We for-mulate the weight annotation problem as a supervisedlearning problem, namely a regression problem [18]that employs TC as the training data set to estimatecost variables in vector d.

The regression problem is solved by minimizinga judiciously designed objective function composedof three sub-objective items. The first item measuresthe misfit between the given actual cost and theestimated cost (i.e., the cost obtained from the costmodel described in Equation 1) for every trip in TC.The second item measures the differences between thecost variables of two edges whose expected trafficflows (based on topological structures) are similar.The third item measures the differences between thecost variables of two edges which are directionallyadjacent. Further, other appropriate metrics that canquantify the difference between the cost variables oftwo edges can also be incorporated into the module.Finally, minimizing the objective function is handledby solving a system of linear equations.

4 OBJECTIVE FUNCTIONS

Since we regard the problem as a regression problem,we elaborate on the design of the proposed objectivefunction and the solution to minimizing the objectivefunction.

4.1 Residual Sum of SquaresIn order to obtain an appropriate estimation of thecost vector d, we need to make sure that for every(trip, cost) pair (t(i), c(i)) ! TC, the misfit betweenthe actual cost (e.g., c(i)) and the estimated cost (e.g.,cost(t(i)) evaluated by Equation 1, which employs d),is as small as possible. To quantify the misfit, theresidual sum of squares (RSS ) function is applied,where

RSS (d) ="

(t(i),c(i))"TC(c(i) % cost(t(i)))2.

To facilitate the following discussion, we derive amatrix representation of the RSS function, as shown

in Equation 2.

RSS(d) = ||c!QTd||22 (2)

Let the cardinality of the set TC be N (i.e., |TC| = N ).We define a vector c ! RN = [c(1), c(2), . . ., c(N)]T,where c(i) is the given actual cost of the trip t(i),and (t(i), c(i)) ! TC. A matrix Q ! R|d|%N = [q(1),q(2), . . ., q(N)] is introduced to enable us to rephraseEquation 1 into a matrix representation. Specifically,q(k) is the k-th column vector in Q which corre-sponds to trip t(k). If trip t(k) contains a link recordl whose corresponding edge is ei (i.e., l.e = ei), thenq(k)[pos(i, j)] = G.L(ei) · weight(l, tagj) where 1 ! j !|TAGS |; otherwise, it is set to 0.

Different from ordinary regression problems, min-imizing Equation 2 is insufficient for determiningevery cost variable in d because the trips in TC maynot cover all the edges in the road network, e.g., allthe edges in E. For the edges that are never traversedby any trip in TC, their corresponding cost variablesin d cannot be determined by only minimizing theRSS function.

In this case, annotating the edges that do not ap-pear in TC with weights seems to be difficult andeven unsolvable. In the following, we try to use thetopology of the road network to further propagateand constrain the cost variables in order to assign anappropriate weight to every edge.

4.2 Topological ConstraintThe topology of a road network is highly correlatedwith human movement flow [19], [20], including themovement of both pedestrians and vehicles. Edgeswith similar movement flows can be expected to havesimilar cost variables. Thus, if an edge is covered inTC, its cost variable information can be propagated tothe edges that have similar movement flows. To thisend, we study how to quantify movement flow basedsimilarity between edges using topological informa-tion of road networks.

4.2.1 Modeling Traffic Flows with PageRankWe transfer the idea of using PageRank for themodeling of web surfers to the modeling of vehiclemovement in road networks. The original PageRankemploys the hyperlink structure of the web to build afirst-order Markov chain, where each web page corre-sponds to a state [21]. The Markov chain is governedby a transition probability matrix M. If web page ihas a hyperlink pointing to web page j then M[i, j]is set to 1

outDegree(i) ; otherwise, it is set to 0. M[i, j]indicates the probability of transition from state i tostate j. PageRank models a user browsing the web asa Markov process based on matrix M, and the finalPageRank vector is the stationary distribution vectorx of matrix M. The PageRank of web page i, i.e., x[i],indicates the probability that the user visits page i or,


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



equivalently, the fraction of time the user spends onpage i in the long run [21].

The modeling movements of vehicles on a roadnetwork as stochastic processes is well studied in thetransportation field [22]. In particular, the modeling ofvehicle movements as Markov processes is an easy-to-use and effective approach [20]. Thus, we build afirst-order Markov chain with a transition probabilitymatrix derived from both the topology of the roadnetwork and the trips that occur in the road network.A state corresponds to an edge in the primal graph(i.e., a directed road segment), not a vertex (i.e., a roadintersection).

The PageRank value of a state indicates the prob-ability that a vehicle travels on the edge or, equiv-alently, the fraction of time a vehicle spends on theedge in the long run. Thus, the PageRank valueis expected to reflect the traffic flow on the edge.Further, a series of topological metrics [19], includingcentrality-based metrics, small-world metrics, space-syntax metrics, and PageRank metrics, have beenapplied to capture human movement flows in urbanenvironments. When using a graph representation ofan urban environment, it is found that the classicaland weighted PageRank metrics are highly correlatedwith human movements [19], [23]. Thus, if two edgeshave similar PageRank values, the traffic flow on thetwo segments should be similar.

When modeling web surfers, PageRank assumesthat the Markov chain is time-homogeneous, meaningthat the probability of transferring from page i to pagej has the same fixed value at all times. In other words,matrix M is static across time. In contrast, the time-homogenous assumption does not hold for vehiclestraveling in road networks. For example, during peakhours, the transition probability from edge i to edgej may be substantially different from the probabilityduring off-peak hours. Thus, we maintain a distincttransition probability matrix Mk for each traffic cat-egory tag tagk. During a particular traffic tag, weassume the Markov chain to be time-homogeneous.

4.2.2 PageRank on Dual Graphs

PageRank was originally proposed to assign prestigeto web pages in a web graph, where web pages aremodeled as vertices and the hyper-links between webpages are modeled as edges. Unlike the web graph,we are not interested in the prestige of vertices (i.e.,road intersections) in the primal graph representationof a road network; rather, we are interested in theprestige of edges (i.e., directed road segments).

In order to assign PageRank values to edges, theprimal graph G = (V, E, L, F , H) is transformedinto a dual graph G$ = (V$, E$), where each vertex inV$ corresponds to an edge in the primal graph, andwhere each edge in E$, denoted by a pair of verticesin V$, corresponds to a vertex in the primal graph.

Since functions L, F , and H are not of interest in thissection, we do not keep them in the dual graph.

To avoid ambiguity, we use the terms edge andvertex when referring to primal graphs and use dualedge and dual vertex when referring to dual graphs.Further, we use the term weight when referring to theweight of an edge in a primal graph, and we use dualweight in the context of dual edges in a dual graph.

We define a mapping D2P : V$'E$ $ V'E to recordthe correspondence between the elements in the dualand primal graphs. Fig. 5 show the dual graph thatcorresponds to the primal graph shown in Fig. 3. Sincethe dual vertex AB corresponds to the edge (A,B) inFig. 3, D2P (AB) = (A,B). Similarly, since the dualedge (CB,BA) corresponds to the vertex B in Fig. 3,D2P((CB,BA)) = B.

BA CB

AB BC

BD

Fig. 5. Dual Graph

The dual graph is able to model an importantcharacteristic of a road network: at a particular in-tersection, the probability of which segment a vehiclefollows depends on the segment via which the ve-hicle entered the intersection. Considering the roadnetwork shown in Fig 2, at intersection (i.e., vertex) B,a vehicle can proceed to follow segments (i.e., edges)(B,A), (B,C), or (B,D). If a vehicle entered theintersection using segment (C,B), it may be unlikelythat the vehicle takes a u-turn to follow segment(B,C), while is more likely that it will use the othersegments. Similar cases exist if a vehicle arrived at theintersection using segment (A,B).

Modeling this characteristic in a primal graph is noteasy. For example, we need to maintain two sets ofprobabilities on edge (B,C), for the vehicles camefrom edge (C,B) versus edge (A,B). In contrast,modeling this in a dual graph is straightforward,as how a vehicle entered a particular intersection isclearly represented as a dual vertex. For example, theprobabilities on dual edges (CB,BC) and (AB,BC)record the probabilities that a vehicle entered intersec-tion B from edge (C,B) and edge (A,B), respectively,and continues along edge (B,C).

Given the dual graph G$ = (V$, E$), original PageR-ank values are defined formally as follows.

PR(v!i) =1! df|V!| + df ·

!

v"j"IN (v"

i)

PR(v!j)

|OUT (v!j)|, v!i " V!, (3)

where PR(v$i) indicates the PageRank value of dualvertex v$i; IN (v$i) indicates the set of in-link neighbors


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



of v$i, i.e., IN (v$i) = {v$x|(v$x, v$i) ! E$}; and OUT (v$j)indicates the set of out-link neighbors of v$j , i.e.,OUT (v$j) = {v$x|(v$j , v$x) ! E$}. Further, df ! [0, 1] isa damping factor, which is normally set to 0.85 forranking a web graph.

The intuition behind Equation 3 is that the PageR-ank values are composed of two parts: jumping toanother random vertex and continuing the randomwalk. This assumption works fine on the web graph,but we need to adapt this to the different character-istics of the graph representing a road network. In aroad network, it is impossible for a vehicle to choosea random edge to traverse when at an intersection.Rather, it can only choose to continue along one of theout-link (dual) edges. Based on this observation, weset the damping factor df to 1. Some existing empiricalstudies [19] also suggest that with the damping factorset to 1, the resulting PageRank values have the bestcorrelation with the human movement flows.

4.2.3 Weighted PageRank Computation

Definition of Dual Weights: In the original PageRankalgorithm, a vertex propagates its PageRank valueevenly to all its out-link neighbors. In other words,the dual weight for each dual edge from dual vertexv$j is set uniformly to 1

|OUT(v"j)|

. The uniform weightson the web graph indicate that a web surfer choosesits next target web page without any preferencesto continue its random surfing. However, in a roadnetwork, such non-preference surfing usually doesnot occur. For example, the next step where a vehiclecontinues often depends on where the vehicle camefrom, as discussed in Section 4.2.2. Also, if Avenue 1and Avenue 2 are the main roads in the road networkshown in Fig. 2, more vehicles travel from AB to BCthan from AB to BD. Further, during different trafficcategory tags, the transitions between dual verticesmay also be quite different.

With the availability of very large collections of GPSdata, we are able to capture the probability that avehicle transits from one road segment to another atan intersection during different traffic category tags.Assume we only distinguish between peak and off-peak hours, i.e., there are only two correspondingtags in TAGS . Suppose we obtain the number of tripsoccurred on dual edges, as shown in Table 2.

Tags (AB,BC) (AB,BD) (AB,BA)

PEAK 30 10 0OFFPEAK 5 5 0

TABLE 2Numbers of Trips Occurred on Dual Edges

For example, among all the trips that occurredon dual vertex AB during the peak hours, 30 tripsproceeded to follow BC, and 10 trips followed BD;

during off-peak hours, 5 trips followed BC, and 5trips followed BD. These observations suggest thatthe dual weight on dual edge (AB, BC) should begreater than the dual weight on dual edge (AB, BD)during peak hours; while they should be the sameduring off-peak hours.

As the dual graph has different dual weights fordifferent traffic tags, we need to maintain a dual graphfor each traffic tag. Specifically, the training data setTC is partitioned into TC1, TC2, . . ., TC|TAGS | accord-ing to the traversal times. Partition TCk consists onlyof the trips that are occurred during the time periodindicated by the traffic tag tagk, i.e., G.F#1(tagk).

The dual weight of a dual edge (v$i, v$j) during tag

tagk is related to the ratio of the number of trips thattraversed the dual vertices v$i and v$j to the numberof trips that traversed the dual vertex v$i, during tagtagk. Further, to contend with data sparsity, Laplacesmoothing is applied to smooth the dual weight val-ues for the dual edges that are not covered by any tripin TC. The dual weight of dual edge (v$i, v$j) for thedual graph within tagk (denoted as G$

k) is computedbased on Equation 4.

Wk(v!i, v

!j) =

|Tripk (v!i, v!j)|+ 1"v"x"OUT(v"

i )|Tripk (v!i, v!x)|+ |OUT (v!i)|

, (4)

where Tripk (v$i, v$j) returns the set of trips in partition

TCk that traversed the dual vertices v$i and v$j .Continuing the example shown in Table 2, although

no trip goes from the dual vertex AB directly backto BA in TC, this does not mean that such a tripwill not occur in the future. Thus, we need to givea small, non-zero value to the dual weight of dualedge (AB, BA). Using the dual weights provided byEquation 4, the dual weights of the out-linking dualedges of dual vertex AB are: WPEAK (AB, BC) = 31

43 ,WPEAK (AB, BD) = 11

43 , and WPEAK (AB, BA) = 143 ;

and WOFFPEAK (AB, BC) = 613 , WOFFPEAK (AB,

BD) = 613 , and WOFFPEAK (AB, BA) = 1

13 .Note that for a given dual vertex v$i, if no trips in

TC are available to assign the dual weights during atraffic tag tagk, i.e., |Tripk(v

$i, v

$x)| = 0 for every v$x !

OUT (v$i), Equation 4 assigns weights with 1|OUT(v"

i)|to each dual edge, which is exactly what the originalPageRank algorithm does. For instance, if no trips areavailable for dual vertex AB (i.e., if the numbers inTable 2 are all zeros), the dual weights for Wk(AB,BC), Wk(AB, BD), and Wk(AB, BA) are all 1

3 .Computing Weighted PageRank Values: Basedon the dual weights obtained from Equation 4,we construct the transition probability matricesMk!R|V"|%|V"|. Specifically, the ith row and jth col-umn element in Mk, i.e., Mk[i, j], equals Wk(v$i, v

$j)

if the dual edge (v$i, v$j) exists in the dual graph; oth-

erwise, it equals 0. Note that the sum of all elementsin a row equals 1, i.e.,

!|V"|j=1Mk[i, j] = 1 for every

1 ! i ! |V$|.


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



Let vector vk!R|V"| record the PageRank values forevery dual vertex in G$

k. Specifically, vk[i] = PRk (v$i),which is the PageRank value of v$i during trafficcategory tag tagk. This way, the PageRank values canbe computed iteratively as follows until converged.

vk(n+1) = Mk

T · vk(n),

where vk(n) is the PageRank vector in the n-th itera-

tion.

4.2.4 PageRank-Based Topological Constraint Ob-jective FunctionAfter obtaining the weighted PageRank values for ev-ery dual edge, the topological similarity between twoedges in the primal graph is quantified in Equation 5.

SPRk (ei, ej) =

min(PRk(v!ei),PRk(v

!ej ))

max(PRk(v!ei),PRk(v!ej ))(5)

The topological similarity between edges ei andej , denoted as SPR

k (ei, ej), is defined based on theweighted PageRank values of the two dual verticesrepresenting the edges. To be specific, v$ei and v$ejindicate the corresponding dual vertices of edges eiand ej , i.e., D2P(v$ei) = ei and D2P(v$ej ) = ej . Notethat Equation 5 returns a high similarity if two edgeshave similar weighted PageRank scores and that itreturns a low similarity, otherwise.

Based on the topological similarity, a PageRank-based Topological Constraint (PRTC ) function is in-corporated into the overall objective function. The in-tuition behind the PRTC function is that for the sametraffic category tag, if two edges have similar trafficflows (as measured by Equation 5), their cost variablestend to be similar as well. The PRTC function isdefined in Equation 6.

PRTC (d) =

|TAGS|!

k=1

PRTC (d, k), (6)

where

PRTC (d, k) =

|G.E|"

i,j=1

SPRk (ei, ej) ·(d(ei,tagk)%d(ej ,tagk))

2.

The value of the PRTC function over the cost vectord is the sum of PRTC (d, k) for every 1 ! k ! |TAGS |.The function PRTC (d, k) computes the weighted (de-cided by SPR

k ) sum of the squared differences ofbetween each pair of road segments’ cost variablesduring traffic tag tagk.

The PRTC function has two important features:(i) if the PageRank values of two edges are similar,the similarity value SPR

k is large, thus making thedifference between their cost variables obvious; (ii)if two edges’ PageRank values are dissimilar, thesimilarity value SPR

k with a small value smoothesdown the difference between their cost variables. Thisway, minimizing the PRTC function corresponds to

minimizing the overall difference between two costvariables whose corresponding road segments havesimilar traffic flows.

To obtain the matrix representation of the PRTCfunction, we introduce a matrix A ! R|d|%|d|, whichis a block diagonal matrix.

A =

#

$%

A1

A2

. . .A|TAGS|

&

'( (7)

where Ak ! R|E|%|E| and Ak[i, j] = SPRk (ei, ej),

which obviously is a symmetric matrix. Let matrixLA be the graph Laplacian induced by the similaritymatrix A. Specifically, LA[i, j] = !i,j ·

!xA[i, x]%A[i, j],

where !i,j returns 1 if i equals j, and 0 otherwise. Thematrix representation of PRTC function is shown inEquation 8.

PRTC (d) = dTLAd (8)

4.2.5 Properties of PageRank on Road NetworksWeb graphs and road network graphs are quite differ-ent, rendering it of interest to study the distributionsof PageRank values on the two kinds of graphs. Fig. 6shows the normalized (to (1, 100]) PageRank valueswith respect to the percentage of vertices having thePageRank values, on a graph (WEB) representing apart of the Web1 and a dual graph (NJ) representingthe road network of North Jutland, Denmark.

0.0001

0.001

0.01

0.1

1

10

100

0 20 40 60 80 100

Perc

enta

ge o

f Ver

tices

(%)

Normalized PageRank Values

WEB

0.0001

0.001

0.01

0.1

1

10

100

0 20 40 60 80 100

Perc

enta

ge o

f Ver

tices

(%)

Normalized PageRank Values

NJ

(a) The Web (b) Road Networks

Fig. 6. PageRank on the Web and a Road Network

Fig. 6 suggests that PageRank values on NJ aredistributed more uniformly than for WEB. With thistype of distribution, many vertices have the same orvery similar high PageRank values, which renders thedistribution ineffective for ranking when compared toWEB. However, the distribution is effective for ourobjective of identifying road segments with similartraffic flows based on PageRank values.

4.3 Adjacency ConstraintThe PRTC function is derived from the overall struc-ture of the road network. In this section, we consider afiner-grained topological aspect of the road network,namely, directional adjacency.

An important feature of a road network is that anevent at one road segment may propagate to influence

1. http://snap.stanford.edu/data/web-Google.html


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



adjacent road segments. Consider a typical event ina road network, e.g., traffic congestion. If congestionoccurs on road segment (A,B) in Fig. 2, road segment(B,C) may also experience congestion, or at leastthe traffic on (B,C) is affected by the congestionthat occurs on (A,B). Thus, the cost variables oftwo directionally adjacent road segments should besimilar.

The directional adjacency we discus here is repre-sented clearly in the dual graph. If and only if twodual vertices are connected by an dual edge in thedual graph, the two corresponding road segments aredirectionally adjacent. For example, although edges(B, D) and (B, C) (in Fig. 3) intersect, their cost vari-ables may not necessarily tend to be similar becauseno vehicle can travel between these two edges. Direc-tional adjacency is distinct from the “non-directional”adjacency considered in previous work [7].

Another point worth noting is that if two roadsegments represent opposite directions of the samephysical road segment, they are not directionally ad-jacent. It is natural that an event on a physical roadonly yields congestion in one direction, but not bothdirections. Considering the edges (A,B) and (B,A)(in Fig. 3), their corresponding vertices in the dualgraph (AB and BA in Fig. 5) are connected by twoedges, however, their cost variables are not necessarilysimilar.

Directional adjacency is also temporally sensitive.For example, although edges (A,B) and (B,C) aredirectionally adjacent, the general traffic situation (in-dicated by the cost variable) on edge (A,B) duringpeak hours is not necessarily correlated with thetraffic on edge (B,C) during non-peak hours.

To incorporate directional adjacency, we incorporatea Directionally Adjacent Temporal Constraint (DATC )function into the overall objective function.

DATC (d) =k=|TAGS|!

k=1

DATC (d, k), (9)

where

DATC (d, k) =

|G.E|"

i,j=1

W $k(v

$ei , v

$ej )·(d(ei,tagk)%d(ej ,tagk))

2,

and where v$ei and v$ej have the same meaning as inEquation 5. W $

k(v$ei , v

$ej ) is as defined in Equation 4

if v$ei and v$ej do not indicate the same physicalroad segment; and W $

k(v$ei , v

$ej ) equals 0 otherwise.

For instance, although WPEAK (AB,BA) = 143 as

discussed in Section 4.2.3, W $PEAK (AB,BA) = 0 since

AB and BA indicate the same physical road segment,Avenue 1.

The DATC function aims to make the cost variablessatisfy the following property: given road segments eiand ej , if a many of the trips that follow ei also followej , as indicated by W $

k(v$ei , v

$ej ), the cost variables on

the two edges tend to be more correlated.

Similar to the discussion in Section 4.2.4, we in-troduce a block diagonal matrix B ! R|d|%|d| withthe same format as matrix A (defined in Equa-tion 7). In particular, in each block matrix, Bk[i, j] =max(W $

k(v$ei , v

$ej ), W

$k(v

$ei , v

$ej )), which guarantees that

matrix Bk, and hence matrix B, are symmetric. Notethat it is not possible that both W $

k(v$ei , v

$ej ) and

W $k(v

$ej , v

$ei) are non-zero because if edge D2P(v$ei)

is directionally adjacent to edge D2P(v$ej ) then edgeD2P(v$ej ) cannot be directionally adjacent to edgeD2P(v$ei). Let LB to be the graph Laplacian derivedby matrix B. The DATC function is represented byEquation 10.

DATC (d) = dTLBd (10)

4.4 Solving The ProblemCombining the three individual objective functionsand a classical L2 regularizer, we obtain the overallobjective function O(d):

O(d) = RSS (d)+"·PRTC (d)+# ·DATC (d)+$ ·||d||22,

where ", #, and $ are hyper-parameters that controlthe tradeoff among the losses on RSS , PRTC , DATC ,and the L2 regularizer. The matrix representation ofthe objective function is shown in Equation 11.

O(d) = ||c!QTd||22+!·dTLAd+" ·dTLBd+# ·||d||22 (11)

By differentiating Equation 11 w.r.t. vector d andsetting it to 0, we get

[QQT + ! · LA + " · LB + # · I]d = Qc. (12)

The solution to Equation 12 is the optimal solutionto the cost vector, denoted as #d, that minimizes theoverall objective function in Equation 11. The linearsystem in Equation 12 can be solved efficiently byseveral iterative algorithms such as the conjugategradient algorithm [24].

Finally, feeding the optimized cost variable vector#d to function G.H , the time varying weights of thegraph become available.

4.5 DiscussionIn addition to the topology of a road network, otheraspects of edges may be useful for identifying simi-larities among edges, e.g., the shapes and capacitiesof edges and the points of interest along edges [25].Such information is not always available in digitalmaps and can be difficult to obtain. However, it is ofinterest to extend the proposed methods to take ad-ditional information, when available, into account. Toachieve general applicability of the paper’s methods,we minimize the requirements of the input graph G$$:both PRTC and DATC rely solely on the topology ofa road network, which can be obtained easily fromany digital map.


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



The weight annotation problem is finally handledby solving a system of linear equations, i.e., Equa-tion 12. Alternative edge similarity metrics (e.g., con-sidering the shapes and capacities of edges) can beeasily incorporated into the linear system by addingnew terms of the form % · LM, where % is the hyper-parameter and LM is the Laplacian matrix derived byan alternative similarity metric. An alternative similar-ity metric sim should satisfy symmetry: sim(ei, ej) =sim(ej , ei). Both PRTC and DATC satisfy symmetry.

The core operations in solving a system of lin-ear equations using a conjugate gradient algorithmare matrix multiplication and transposition. Thismeans that existing scalable matrix computation algo-rithms [26], [27] can be applied directly to make theproposed framework scalable and applicable to largeroad networks.

5 EXPERIMENTAL STUDYWe study the effectiveness of the proposed method forweight annotation of road networks with both traveltime (TTWA) and GHG emissions (GEWA).

5.1 Experimental SetupRoad Networks: We use two road networks. The SKnetwork is from Skagen, Denmark and has a primalgraph with 543 vertices and 1, 244 edges. The NJ net-work contains almost all of North Jutland, Denmarkand has a primal graph with 17, 956 vertices and39, 372 edges.

Trips: We use GPS observations collected from 28vehicles in the period 2007-10-01 to 2007-10-15. Whenthe vehicles were moving, positions were sampled at1 Hz. The data is collected as part of an experimentwhere young drivers start out with a substantial re-bate on their car insurance and then are warned if theyexceed the speed limit and are penalized financiallyif they continue to speed.

We apply an existing tool for map matching GPSobservations onto road segments, thus obtaining 431trips in the SK network and 11, 516 trips in the NJnetwork.

For TTWA, we use the total travel time for eachtrip, which can be obtained directly from the GPSobservations of the trip, as the cost.

For GEWA, we use the GHG emissions of eachtrip as trip cost. Ideally, the exact fuel consumptionshould be obtained from CAN bus sensor data. Sincesuch data is hard to obtain in a scalable fashion, weuse instead the VT-micro model [15] that is able tocompute the GHG emissions of trips based on the in-stantaneous velocities and accelerations derived fromthe GPS records of the trips in a robust fashion [3]. The1 Hz GPS sampling frequency makes the VT-Micromodel easy to use.

Traffic Category Tags: In transporation research,PEAK and OFFPEAK periods are used widely to

distinguish different traffic flows over the course ofa day [28]. Thus, we use PEAK and OFFPEAK astraffic category tags. Further, we distinguish betweenweekdays from weekend days, as traffic differs be-tween weekdays and weekend days. To appropriatelyassign PEAK and OFFPEAK tags to the data set, weplot the numbers of GPS records according to theircorresponding observed time at an one-hour granu-larity for weekdays and weekend days, respectively.Based on the generated histograms, we identify PEAKand OFFPEAK periods for weekdays. We find noclear peak periods during weekends and thus useWEEKENDS as the single tag for weekends. Table 3provides the mapping (i.e., the function G.F ) fromtime periods to tags.

Periods Tags

Weekdays [0:00, 7:00) OFFPEAKWeekdays [7:00, 8:00) PEAKWeekdays [8:00, 15:00) OFFPEAKWeekdays [15:00, 17:00) PEAKWeekdays [17:00, 24:00) OFFPEAKWeekends [0:00, 24:00) WEEKENDS

TABLE 3Traffic Category Tag Function G.F

T-Drive [10] is able to assign distinct and fine-grained traffic tags to individual edges. The precon-dition of the method is that sufficient GPS data isassociated with edges. However, a substantial fractionof all edges have no GPS data in our setting. Thus,we use traffic tags at the coarse granularity shown inTable 3.

Implementation Details: The PageRank computa-tion is implemented in C using the iGraph libraryversion 0.5.4 [29]. All remaining experiments are im-plemented in Java, where the conjugate gradient al-gorithm for solving a linear system is implementedusing the MTJ (matrix-toolkits-java) package [30].

We use the threshold 0.95 to filter the entries in thePageRank-based similarity matrix A (Equation 7): ifthe value of an entry in A is smaller than 0.95, theentry is set to 0. We use the speed limits associatedwith roads to classify the edges into two categories,highways (with speed limits above 90 km/h) and urbanroads (with speed limits below 90 km/h). We onlyapply adjacency constraint on pairs of edges in thesame category.

Due to the space limitation, the experiments only re-port the results using the best set of hyper-parameters,which are is obtained by manual tuning on a separatedata set using cross validation. This is a well knownmethod [18] for choosing hyper-parameters.

5.2 Experimental Results5.2.1 Effectiveness MeasurementsTo gain insight into the accuracy of the obtained tripcost based weights, we split the set of (trip, cost)


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



pairs into a training set TCtrain and a testing setTCtest . We use the the training set to annotate thespatial network with weights, and we use the thetesting set to evaluate the accuracy of the weights. Inthe following experiments, we randomly choose 50%of the pairs for training and the remaining 50% fortesting, unless explicitly stated otherwise.

Since no ground-truth time-dependent weights ex-ist for the two road networks, the accuracy of theobtained weights can only be evaluated using the tripsin testing set TCtest. If the obtained weights (usingTCtrain) actually reflect the travel costs, the differencebetween the actual cost and the estimated cost usingthe obtained weights (i.e., by using Equation 1 definedin Section 3.2) for each trip in the testing set TCtest

should be small.We use the sum of squared loss (SSL) value (defined

in Equation 13) between the actual cost c(i) and theestimated cost cost(t(i)) over every trip in the testingset TCtest to measure the accuracy of the obtainedweights.

SSL(TCtest) =!

(t(i),c(i))"TCtest

(c(i) ! cost(t(i)))2 (13)

For example, if the GHG emissions based weightsreally reflect the actual GHG emissions, the sum ofsquared loss between the actual GHG emissions andthe estimated GHG emissions over every testing tripshould tend to be small. The smaller the sum ofsquared loss, the more accurate the weights.

To gain insight into the effectiveness of the pro-posed objective functions, we compare four combi-nations of the functions:

1) F1=RSS (d) + $ · ||d||22.2) F2=RSS (d) + " · PRTC (d) + $ · ||d||22.3) F3=RSS (d) + # ·DATC (d) + $ · ||d||22.4) F4=RSS (d)+"·PRTC (d)+#·DATC (d)+ $·||d||22.

Function F1 only considers the residual sum ofsquares. Functions F2 and F3 take into account thePageRank-based topological constraint and the direc-tional adjacency constraint, respectively. Function F4

takes into account both constraints.As the objective function used in trajectory regres-

sion [7] also considers adjacency, we can view themethod using function F3 as an improved versionof trajectory regression because (i) function F3 worksnot only for travel times, but also other travel costs,e.g., GHG emissions; (ii) function F3 considers thetemporal variations of travel costs, while trajectoryregression does not; and (iii) function F3 considersdirectional adjacency, while trajectory regression mod-els a road network as a undirected graph and onlyconsiders undirected adjacency.

The sum of squared loss value for using ob-jective function Fi is denoted as SSLFi(TCtest).In order to show the relative effectiveness of theproposed objective functions, we report the ra-

tios RatioF2=SSLF2 (TCtest )SSLF1 (TCtest )

, RatioF3=SSLF3 (TCtest )SSLF1 (TCtest )

, and

RatioF4=SSLF4 (TCtest )SSLF1 (TCtest )

.Coverage, defined in Equation 14, is introduced as

another measurement.

CoveFi(TCtrain) =|{e|e " G.E # annotated(e)}|

|G.E| , (14)

where annotated(e) holds if edge e is annotated withweights using TCtrain . Function CoveFi indicates theratio of the number of edges whose weights havebeen annotated by using objective function Fi tothe total number of edges in the road network. Thehigher the coverage is, the more edges in the roadnetwork are annotated with weights, and thus thebetter performance.

5.2.2 Travel Time Based Weight Annotation

Effectiveness of objective functions: Table 4 reportsthe results on travel time based weight annotation.Column SSLF1 reports the absolute SSL values overall test trips when using objective function F1 forboth data sets. NJ has much larger SSL values thanSK because it has much more testing trips. For bothroad networks, the weights annotated using objectivefunction F4 have the least SSL values.

SSLF1 RatioF2 RatioF3 RatioF4

SK 88,656 99.2% 44.0% 43.8%NJ 14,823,752 92.2% 49.2% 43.1%

TABLE 4Effectiveness on TTWA

We also observe that the PageRank based topo-logical constraint works more effectively on NJ thanon SK. The reason is that Skagen is a small townin which few road segments have similar topology(e.g., similar weighted PageRank values). In the NJnetwork, the PageRank based topological constraintgives a better accuracy improvement since more roadsegments have similarly weighted PageRank values.

The coverage reported in Table 5 also justifies theobservation. When using objective function F1, onlythe edges in the set of training trips can be annotated,which can be expected to be a small portion of theroad network. When using objective function F2, the

CoveF1 CoveF2 CoveF3 CoveF4

SK 22.8% 28.8% 100% 100%NJ 34.8% 86.7% 99.6% 100%

TABLE 5Coverage of Weight Annotation

coverage of the SK network increases much less thanfor the NJ network. This suggests that in a large roadnetwork, the PageRank based topological constraintsubstantially increases the coverage of the annotation,thus improving the overall annotation accuracy.


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



The directed adjacency topological constraint yieldssimilar accuracy improvements on both road net-works, and the accuracy improvement is more sub-stantial than the improvement given by the PageR-ank based topological constraint. This is as expectedbecause a road network is fully connected, and DATCis able to finally affect almost every edge, which givesmore information for the edges that are not traversedby trips in the training set. This can be observed fromthe third column of Table 5.

For both road networks, PRTC and DATC togethergive the best accuracy, as shown in column RatioF4

in Table 4. This finding offers evidence of the overalleffectiveness of the proposed objective functions.

Accuracy comparison with a baseline: The test tipscontain edges that are not covered by any trainingtrips. Therefore, existing methods [10] that can esti-mate travel time based on historical data are inappli-cable as baseline.

If the speed limit of every edge in a road networkis available, we can use speed limit derived weightsas a baseline for travel time based weight annotation.While it is difficult to obtain a speed limit for everyroad segment in a road network, we can use defaultvalues were values are missing. In the NJ network, 62edges lack a speed limit and are assigned a defaultvalue (50 km/h).

Given an edge e and its speed limit sl(e) and lengthG.L(e), the corresponding travel time based weightfor e is & · G.L(e)

sl(e) if e is an urban road (where & ( 1)and G.L(e)

sl(e) if e is a highway.The factor & is used because vehicles tend to travel

at speeds below the speed limit on urban roads and atthe speed limit on highways. Previous work [7] uses& = 2, meaning that vehicles normally travel at halfthe speed limit in urban regions. However, we findthat & = 1 works the best for our data. The reasonmay be two-fold: (i) the data we use is collected fromyoung drivers who tend to drive more aggressivelythan average drivers. (ii) the SK and NJ networks arerelatively congestion-free when compared to Kyoto,Japan, which is simulated in previous work [7].

The above allows us to treat the speed limit derivedweights as a baseline method for travel time basedweight annotation. To observe the accuracy of thebaseline method, its accuracy is also evaluated usingSSL over every testing trip. Specifically, the baselinewith & = 2 is denoted as SSLBL,!=2(TCtest), and thebaseline with & = 1 is denoted as SSLBL,!=1(TCtest).The two resulting baselines are compared withthe proposed method, and the results are reportedin Table 6, where Ratio!=2= SSLF4 (TCtest )

SSLBL,!=2(TCtest )and

Ratio!=1= SSLF4 (TCtest )SSLBL,!=1(TCtest )

. The ratios Ratio!=1 on thetwo road networks show that the weights obtainedby our method are substantially better than the bestcases of the weight obtained from the speed limits.

The same deviation has quite a different meaning

Ratio!=2 Ratio!=1

SK 36.0% 78.8%NJ 24.2% 90.8%

TABLE 6Comparison With Baselines on TTWA

for long versus short trips. For example, a 50-seconddeviation can be considered as a very good estimationerror for a 30-minute trip, while it is a poor estimationerror for a 2-minute trip. Thus, to better understandhow the overall SSL values are distributed, we plotthe number of test trips whose absolute loss ratio (ALR)values are within x percentage in Fig. 7. Given atest pair (t(i), c(i)) ! TCtest , its ALR value equalsthe absolute difference between the estimated andactual costs divided by the actual cost, as defined inEquation 15.

ALR((t(i), c(i))) =absolute(cost(t(i))! c(i))

c(i)(15)

Our method shows the best result as the majorityof the test trips have smaller ALR values. Assumethat we consider and ALR below 30% as a goodestimation. Fig. 7 shows that 84.3% of test trips havegood estimations using the proposed method. In con-trast, only 67.4% and 22.1% of test trips have goodestimations using baseline methods with & = 1 and& = 2, respectively.

0

10

20

30

40

50

60

10 20 30 40 50 60 70 80 90more

Perc

enta

ge o

f Trip

s (%

)

ALR is less than x%

BaselineObjective Function F4

0

10

20

30

40

50

60

10 20 30 40 50 60 70 80 90more

Perc

enta

ge o

f Trip

s (%

)

ALR is less than x%


(a) Baseline with $ = 2 (b) Baseline with $ = 1

Fig. 7. ALR Comparison on TTWA of NJ

We do not integrate speed limits into our methodbecause (i) for edges without available speed limits,the obtained weights are quite sensitive to the as-signed default speed limits: inaccurate defaults dete-riorate the performance severely; and (ii) speed limitsdo not give obvious benefits when annotating edgeswith GHG emissions based weights, as we will seeshortly in Section 5.2.3 (in particular, in Fig. 8).

5.2.3 GHG Emissions Based Weight AnnotationEffectiveness of objective functions: Table 7 reportsthe results on GHG emissions based weight annota-tion. In general, the results are consistent with the re-sults from the travel time based weight annotation (asshown in Table 4): (i) The PageRank-based topologicalconstraint works more effectively on the NJ networkthan on the SK network; (ii) the directed adjacencyconstraint works more effectively than the PageRank-based topological constraint; (iii) the weights obtained


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



SSLF1 RatioF2 RatioF3 RatioF4

SK 175.931 99.9% 40.3% 30.0%NJ 87,362,465 94.5% 66.2% 44.3%

TABLE 7Effectiveness on GEWA

by using both PRTC and DATC give the best accu-racy. The coverage when using the different objectivefunctions is exactly the same as what was reported inTable 5.

Comparison with a baseline: As we did for traveltimes, we use speed limits to devise a baseline forGHG emissions based weight annotation. Assuminga vehicle travels on an edge at constant speed (e.g., thespeed limit of the edge), we can simulate a sequenceof instantaneous velocities. For example, let an edgebe 100 meters long and the speed limit be 60 km/h.The simulated trip on the road segment is representedby a sequence of 6 records, each with 60 km/h as theinstantaneous velocity. This allows us to apply the VT-micro model to estimate GHG emissions based edgeweights. Since in the previous set of experiments, wehave already found that the speed limit (i.e., & = 1)is the best fit for our data we simply use the speedlimit here.

We obtain Ratio!=1 = 24.7% for SK and Ratio!=1 =29.8% on NJ. Fig. 8 shows the percentage of test tripswhose ALR values are less than x% using the baselinewith & = 1 and the proposed method, respectively.These results clearly show the better performance ofthe proposed method, as the majority of test trips havesmaller ALR values.

0

10

20

30

40

50

10 20 30 40 50 60 70 80 90more

Perc

enta

ge o

f Trip

s (%

)

ALR is less than x%


Fig. 8. ALR Comparison on GEWA of NJ

5.2.4 Effectiveness of the Size of Training TripsIn this section, we study the accuracy when varyingthe training set size. Specifically, on the NJ network,we reserve 20% of the (trip, cost) pairs as the testingset, denoted as TCtest , and the remaining 80% as thetraining set, denoted as TCtrain. In order to observethe accuracy of weight annotation on different sizesof TCtrain , we use 100%, 80%, 60%, 40% and 20%of TCtrain to annotate the weights, respectively. Theresults are shown in Fig. 9.

For travel time, when only 20% of TCtrain is used,the accuracy of our method is worse than the baselinemethod with & = 1 because the baseline has a roughestimation for the costs of all edges, while the 20%

0

20

40

60

80

100

120

140

20 40 60 80 100

Rat

ioλ=

1 (%

)

Percentage of TCtrain (%)

Travel TimeGHG Emissions

Fig. 9. Results on Different Size of TCtrain

of TCtrain covers only 16.3% of the edges in the roadnetwork. Although our method propagates weightsto edges that are not covered by the training trips,the accuracy suffers when the initial coverage of thetraining trips is low. When 40% of TCtrain is used,the accuracy of our method is much better than thatof the baseline. In this case, the training trips cover23.3% of all edges. As the training set size increases,the accuracy of the travel time weights also increases.When we use all trips in TCtrain , the accuracy of ourmethod is almost twice that of the baseline.

For GHG emissions, we observe a similar trend:with more training trips, the accuracy of the corre-sponding weights improves, and our method alwaysoutperforms the baseline when annotating edges withGHG emissions based weights.

This experiment justifies that (i) our method workseffectively even when the coverage of the trips in thetraining set is low; (ii) if the coverage of the trips in thetraining set increases, e.g., by providing more (trip,cost) pairs as training set, the accuracy of the obtainedweights also increases.

6 CONCLUSION AND OUTLOOK

Reduction in GHG emissions from transportation callsfor effective eco-routing, and road network graphswhere all edges are annotated with accurate weightsthat capture environmental costs, e.g., fuel usage orGHG emissions, are needed for eco-routing. How-ever, such weights are not always readily availablefor a road network. This paper proposes a generalframework that takes as input a collection of (trip,cost) pairs and assigns trip cost based weights toa graph representing a road network, where tripcost based weights may reflect GHG emissions, fuelconsumption, or travel time. By using the framework,edge weights capturing environmental impact can becomputed for the whole road network, thus enablingeco-routing. To the best of our knowledge, this isthe first work that provides a general framework forassigning trip cost based edge weights based on a setof (trip, cost) pairs.

Two directions for future work are of particularinterest. It is of interest to explore whether accuracyimprovement is possible by using distinct PEAK andOFFPEAK tags for different road segments. Likewise,


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960



it is of interest to explore means of updating weightsin real time. A module that takes as input real timestreaming data, e.g., real time GPS observations alongwith costs, can be incorporated into the framework.

ACKNOWLEDGMENTSThis work was supported by the Reduction projectthat is funded by the European Commission as FP7-ICT-2011-7 STREP project number 288254.

REFERENCES[1] What is the EU doing on climate change? http://ec.europa.

eu/clima/policies/brief/eu/index en.htm.[2] Reducing emissions from transport. http://ec.europa.eu/

clima/policies/transport/index en.htm.[3] C. Guo, Y. Ma, B. Yang, C.S. Jensen, and M. Kaul. EcoMark:

Evaluating models of vehicular environmental impact. In GIS,pages 269–278, 2012.

[4] T. Kono, T. Fushiki, K. Asada, and K. Nakano. Fuel consump-tion analysis and prediction model for eco route search. In 15thWorld Congress on Intelligent Transport Systems and ITS America’s2008 Annual Meeting, 2008.

[5] E. Ericsson, H. Larsson, and K. Brundell-Freij. Optimizingroute choice for lowest fuel consumption-potential effects ofa new driver support tool. Transportation Research Part C:Emerging Technologies, 14(6):369–383, 2006.

[6] G. Tavares, Z. Zsigraiova, V. Semiao, and M.G. Carvalho.Optimisation of MSW collection routes for minimum fuelconsumption using 3D GIS modelling. Waste Management,29(3):1176–1185, 2009.

[7] T. Ide and M. Sugiyama. Trajectory regression on roadnetworks. In AAAI, pages 203–208, 2011.

[8] T. Ide and S. Kato. Travel-time prediction using gaussianprocess regression: A trajectory-based approach. In SDM,pages 1183–1194, 2009.

[9] S. Clark. Traffic prediction using multivariate nonparametricregression. Journal of Transportation Engineering, 129(2):161–168,2003.

[10] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, andY. Huang. T-drive: driving directions based on taxi trajectories.In GIS, pages 99–108, 2010.

[11] J. Ygnace, C. Drane, Y. B. Yim, and L. Renaud. Travel timeestimation on the san francisco bay area network using cellularphones as probes. Technical report, Institute of TransportationStudies, UC Berkeley, 2000.

[12] J. C. Herrera and A. M. Bayen. Incorporation of lagrangianmeasurements in freeway traffic state estimation. Transporta-tion Research Part B: Methodological, 44(4):460–481, 2010.

[13] J. Yuan, Y. Zheng, X. Xie, and G. Sun. Driving with knowledgefrom the physical world. In KDD, pages 316–324, 2011.

[14] G. Song, L. Yu, Z. Wang, et al. Aggregate fuel consumptionmodel of light-duty vehicles for evaluating effectiveness oftraffic management strategies on fuels. Journal of TransportationEngineering, 135:611, 2009.

[15] K. Ahn, H. Rakha, A. Trani, and M. Van Aerde. Estimatingvehicle fuel consumption and emissions based on instanta-neous speed and acceleration levels. Journal of TransportationEngineering, 128(2):182–190, 2002.

[16] E. Kohler, K. Langkau, and M. Skutella. Time-expandedgraphs for flow-dependent transit times. In ESA, pages 599–611, 2002.

[17] B. George and S. Shekhar. Time-aggregated graphs for model-ing spatio-temporal networks. In ER (Workshops), pages 85–99,2006.

[18] C.M. Bishop. Pattern recognition and machine learning. SpringerNew York, 2006.

[19] B. Jiang. Ranking spaces for predicting human movementin an urban environment. International Journal of GeographicalInformation Science, 23(7):823–837, 2009.

[20] E. Crisostomi and et al. A google-like model of road networkdynamics and its application to regulation and control. Inter-national Journal of Control, 84(3):633–651, 2011.

[21] A. N. Langville and C. D. Meyer. Survey: Deeper insidepagerank. Internet Mathematics, 1(3):335–380, 2003.

[22] C. F. Daganzo and Y. Sheffi. On stochastic models of trafficassignment. Transportation Science, 11(3):253–274, 1977.

[23] B. Jiang, S. Zhao, and J. Yin. Self-organized natural roads forpredicting traffic flow: a sensitivity study. Journal of StatisticalMechanics: Theory and Experiment, 2008:7008–7035, 2008.

[24] G.H. Golub and C.F. Van Loan. Matrix Computations. JohnsHopkins University Press, 1996.

[25] N.J. Yuan, Y. Zheng, L. Zhang, and X. Xie. T-finder: Arecommender system for finding passengers and vacant taxis.IEEE Transactions on Knowledge and Data Engineering, 2012.

[26] S. Seo, E.J. Yoon, J. Kim, S. Jin, J.S. Kim, and S. Maeng.Hama: An efficient matrix computation with the mapreduceframework. In CloudCom, pages 721–726, 2010.

[27] J. Lin and C. Dyer. Data-intensive text processing withmapreduce. Synthesis Lectures on Human Language Technologies,3(1):1–177, 2010.

[28] P. Cantos-Sanchez, R. Moner-Colonques, J.J. Sempere-Monerris, and A. Alvarez-SanJaime. Viability of new roadinfrastructure with heterogeneous users. TransportationResearch, Part A, 45(5):435–450, 2011.

[29] igraph library. http://igraph.sourceforge.net/.[30] matrix-toolkits-java package. http://code.google.com/p/

matrix-toolkits-java.

Bin Yang received his B.E. and M.E. degreesfrom Northwestern Polytechnical University,China, in 2004 and 2007, respectively, andthe Ph.D. degree in computer science fromFudan University, China in 2010. He workedas a research assistant in Aalborg Univer-sity, Denmark, during 2008–2009. He spentmore than one year at Max-Planck-Institutfur Informatik, Germany, as a postdoctoralresearcher during 2010–2011. In September2011, he joined Aarhus University, Denmark

as a postdoc at the level of research assistant professor. His re-search interests include data management and data analytics. Hehas served on program committees of several database conferencesand as invited reviewer for several database journals, includingICDE, TKDE, and The VLDB Journal.

Manohar Kaul received his B.Engg (Hon-ors) degree from the Department of Com-puter Science and Electronic Engineering,Latrobe University, Australia in 2000. From2000–2009, he worked in industry, primar-ily at ORACLE for 5 years as a SeniorSystems/Database Architect, specializing inhandling very large datasets, especially inthe Utilities, Banking and Telecommunicationsectors. In late 2009, he joined the ComputerScience M.Sc. programme at the Computer

Science Department in Uppsala University, Sweden and graduatedin 2011. Currently, he is a Ph.D. student in the Data IntensiveSystems Group at Aarhus University, Denmark under the supervisionof Prof. Christian S. Jensen. His research interests cover spatio-temporal databases, indexing, and graph theory.

Christian S. Jensen is a Professor of Com-puter Science at Aarhus University, Den-mark, and he was previously at Aalborg Uni-versity for two decades. He recently spent a1-year sabbatical at Google Inc., MountainView. His research concerns data manage-ment and data-intensive systems, and its fo-cus is on temporal and spatio-temporal datamanagement. Christian is an ACM and anIEEE fellow, and he is a member of the RoyalDanish Academy of Sciences and Letters

and the Danish Academy of Technical Sciences. He has receivedseveral national and international awards for his research. He iscurrently vice-chair of ACM SIGMOD and an editor-in-chief of TheVLDB Journal.


123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Using Incomplete Information for Complete Weight Annotation of Road ...mkaul/papers/tkde2013.pdf ·...

Documents

Transcript of Using Incomplete Information for Complete Weight Annotation of Road ...mkaul/papers/tkde2013.pdf ·...