Rafael J. Fernández-Moctezuma April 3, 2007

35
1 Have we met? Trying to find similar events in an archival database in the context of travel time estimation Rafael J. Fernández-Moctezuma April 3, 2007

description

Have we met? Trying to find similar events in an archival database in the context of travel time estimation. Rafael J. Fernández-Moctezuma April 3, 2007. Acknowledgements. - PowerPoint PPT Presentation

Transcript of Rafael J. Fernández-Moctezuma April 3, 2007

Page 1: Rafael J. Fernández-Moctezuma April 3, 2007

1

Have we met?

Trying to find similar events in an archival database in the context of travel time estimation

Rafael J. Fernández-Moctezuma

April 3, 2007

Page 2: Rafael J. Fernández-Moctezuma April 3, 2007

2

Acknowledgements

• Kristin Tufte for introducing me to the “fetch similar” problem – and helping bring the problem to smaller pieces to solve.

• Umut Ozertem (OGI) for valuable discussion from a Machine Learning perspective.

Page 3: Rafael J. Fernández-Moctezuma April 3, 2007

3

Contents

• Issues in estimating travel time– “Instantaneous estimate”– A posteriori estimate

• Combining archived data efficiently

• Imposing structure to minimize the search space

• Preliminary results

• Concurrent / Future work

Page 4: Rafael J. Fernández-Moctezuma April 3, 2007

4

Motivation: Travel time

Direction of flow

Page 5: Rafael J. Fernández-Moctezuma April 3, 2007

5

Travel time

A B C

Direction of flow

MA MB MC

Inductive loop detectors measure speed, occupancy, and volume. These are the three fundamental quantities used to reason theoretically about traffic flow.

Page 6: Rafael J. Fernández-Moctezuma April 3, 2007

6

Travel time

A B C

Direction of flow

MA MB MC

Region of Influence

Page 7: Rafael J. Fernández-Moctezuma April 3, 2007

7

Travel time

A B C

Direction of flow

VMS

α Ω

MA MB MC

Region of Influence

A Variable Message Sign could inform drivers with estimated travel times. Useful before intersections, alternate routes may be selected if considerable delay is ahead.

Page 8: Rafael J. Fernández-Moctezuma April 3, 2007

8

Travel time

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

Region of Influence

Page 9: Rafael J. Fernández-Moctezuma April 3, 2007

9

Instantaneous estimate

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

At time t0, calculatesthe travel time from αto Ω as the sum of timesbetween the regions(i.e., time from α to AB +time from AB to BC +time from BC to Ω).

However, by the timethe vehicle arrives to AB,conditions measuredat B may have changed.

t0 tf

Page 10: Rafael J. Fernández-Moctezuma April 3, 2007

10

A posteriori estimate

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

At time t0, calculatesthe travel time from αto AB (say, t1.). At time t1, look at the condition reported by B, and calculate the time between AB and BC (and so on.)

This is closer to the travel time a vehicle experienced, but this estimate cannot be computed online at α, for the complete segment, since we cannot see the future.

t0 t1 t2 tf

Page 11: Rafael J. Fernández-Moctezuma April 3, 2007

11

Archived data is useful

• It is possible to compute a posteriori estimates for previously observed measurements.

• This opens the possibility for incorporating previously seen travel times (associated with instantaneous measurements) for online estimation.

Page 12: Rafael J. Fernández-Moctezuma April 3, 2007

12

Identical vs. Similar

• We cannot guarantee that all possible combinations of measured values have been observed already.

• We would also like the recall of relevant data point to be a fast process – we don’t want to go through the entire history at every refresh.

• ML people constantly complain about “not having enough data”. In this case, we have a lot of data and we wish to quickly extract a representative sample.

Page 13: Rafael J. Fernández-Moctezuma April 3, 2007

13

Let’s step out and think in general terms: A model system

Data stream of measurements

Archive

Selectionmechanism

similar historical measurementsFusion

mechanism

Estimation

Page 14: Rafael J. Fernández-Moctezuma April 3, 2007

14

A model system

Data stream of measurements

Archive

Selectionmechanism

similar historical measurementsFusion

mechanism

Estimation

How can we do this efficiently?

Page 15: Rafael J. Fernández-Moctezuma April 3, 2007

15

A model system

Data stream of measurements

Archive

Selectionmechanism

similar historical measurementsFusion

mechanism

Estimation

What is a reasonable strategy?

Page 16: Rafael J. Fernández-Moctezuma April 3, 2007

16

A model system

Data stream of measurements

Archive

Selectionmechanism

similar historical measurementsFusion

mechanism

Estimation

Our effort so far has concentrated in

this section.

Page 17: Rafael J. Fernández-Moctezuma April 3, 2007

17

Impose structure in the data archive

• Databases are very efficient when we know what to ask (e.g., “value >= 20” benefits greatly from index lookup, if index exists)

• Can we index “similarity”?• Consider imposing structure on previously

seen information. We can be clever about what to index and reduce the search space on the fly.

Page 18: Rafael J. Fernández-Moctezuma April 3, 2007

18

Similar as “close” in a vector space

Retrieve the 3 nearest points to the new point

For n existing points in the space, we need n comparisons.

This problem is referred to as k-nearest neighbors.

We can define “Close” in terms of Euclidean distance.

Page 19: Rafael J. Fernández-Moctezuma April 3, 2007

19

What if we can prune off some points?

Suppose we are given regional boundaries. It is now feasible to look first at which one is the closest region, and then perform the search within it.

The outcome of clustering can give us such boundaries.

Page 20: Rafael J. Fernández-Moctezuma April 3, 2007

20

K-means

One of the simplest algorithms for clustering, attempts to minimize the variance within members of each cluster, while maximizing the variance between clusters.

It finds the centroids of k clusters. These points may not be observed points.

For this example, an initial comparison with three centroids reduces the search space to ~ 1/3.

Page 21: Rafael J. Fernández-Moctezuma April 3, 2007

21

K-meansThe random start property of k-means implies several limitations, three of which stand out:

(1) For a particular choice of k, there can be one or more solutions.(2) It is possible to end up with empty clusters.(3) Initial choice of centroids can be problematic (bad derivative)

We can cope with these limitations doing several Monte Carlo runs.

Page 22: Rafael J. Fernández-Moctezuma April 3, 2007

22

May not be a small enough set for KNN

What if we could further reduce the search within a cluster?

Suppose a torus is defined in terms of distances to the cluster centroid – we have already pre-computed them in pre-processing, and we have computed the novelty point’s distance when classifying as well.

Only do KNN with points within the two radiuses (d + λ and d – λ).

Page 23: Rafael J. Fernández-Moctezuma April 3, 2007

23

Back to transportation

• Input vector

for a particular time t, in a segment s that contains n sensor stations

),,,,,,,(),( )()()()1()1()1( traveltimeoccupancyvolumespeedoccupancyvolumespeedstv nnn

• What are we clustering?

– The input vectors looking only at the n fundamental measurements.

– Travel time is our measure of interest, i.e., target for prediction. Clustering is “blind” to it.

Page 24: Rafael J. Fernández-Moctezuma April 3, 2007

24

Considerations

• I said that clustering is “blind” to the a posteriori estimate of travel time

• This much is true, but we want clusters that help us predict travel time.

• The core assumption is that the fundamental measurements at a time t are related to the travel time (they are, we just haven’t expressed that yet).

Page 25: Rafael J. Fernández-Moctezuma April 3, 2007

25

Considerations

• Look at the difference in travel times among members of a cluster. This helps us choose a suitable number of clusters.

• We could use variance, but (1) it grows quadratically, and (2) is not intuitive.

• Proposed error function should decrease smoothly as the number of clusters increases.

• The error function is saying “+/- 3 is all the same to me.” . of choice

eachfor clusters over Average

1

time. travelis where

elements, with cluster aFor

||1

1

)()(

1

)()()(

K

K

Kerror

KError

x

nj

xxN

error

i

ik

N

i

jji

j

Page 26: Rafael J. Fernández-Moctezuma April 3, 2007

26

Prototype implementation

• Looked at US 26 E sub-segment• Morning period that includes peak: 06:00 –

11:00• Treated one day as historical, tested on another

day (Oct. 9 2006 and Oct. 12 2006)• Careful: if we just shuffle points to estimate

performance, we are fooling ourselves – the fitting process may have seen past and future. For this domain, if we are to simulate data loss, always leave the test day(s) out.

Page 27: Rafael J. Fernández-Moctezuma April 3, 2007

27

US 26 E

Image from http://maps.google.com/

Looked at three stations:CornellMurrayCedar Hills

Hypothetical VMS between185th and Cornell, with target destination betweenCedar Hills and Parkway.

Segment length: 3.7 miles.

Believe it or not, it can take up to 30 minutes during rush hour (been there, done that).

Page 28: Rafael J. Fernández-Moctezuma April 3, 2007

28

Choice of k

• As expected, error function drops

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12x 10

4

Number of clusters

Erro

r

Choice of k

Page 29: Rafael J. Fernández-Moctezuma April 3, 2007

29

Choice of k

• As expected, error function drops

2 4 6 8 10

100

200

300

400

500

600

Number of clusters

Erro

r

Choice of k

suitable

Page 30: Rafael J. Fernández-Moctezuma April 3, 2007

30

Simplifying criteria

• Radiuses determining the torus centered around the centroid:

R1 = d/2

R2 = 3d/2

Where d is the distance from the novelty to the centroid.

• “Fusion mechanism” is the average of 3-nearest neighbors within the torus.

Page 31: Rafael J. Fernández-Moctezuma April 3, 2007

31

Experimental resultsThe trends during the peak period are followed correctly.

Unsurprisingly, the early peak is somewhat captured – the fitting set did not have one. Still, ups and downs are discovered.

06:00 07:00 08:00 09:00 10:00 11:002

4

6

8

10

12

14

16

18

Time

Trav

el ti

me

(in m

inut

es)

Measured vs. Estimated travel times

a posteriori estimateon-line cluster based estimate

ERRATA: LABELSREVERSED

Page 32: Rafael J. Fernández-Moctezuma April 3, 2007

32

Fitting set timeseries

6:00 7:00 8:00 9:00 10:00 11:000

5

10

15

20

25

30

Time

Trav

el ti

me

(min

utes

)

A posteriori travel time estimates for 10/9/2007

Page 33: Rafael J. Fernández-Moctezuma April 3, 2007

33

Ongoing work (suggested by Kristin)

• Looking at probe runs and comparing the measured times with a posteriori estimates – previous efforts were made with instantaneous estimates only. Curious as to whether the a posteriori estimate is significantly different than the instantaneous one.

• Pick a larger dataset – OR 217 has better sensor density. Test over one month or so.

Page 34: Rafael J. Fernández-Moctezuma April 3, 2007

34

Future work

• Current prototype is in MATLAB – should I start getting familiar with Niagara?

• Any better ideas for the “fusion”? Should this just be an extra parameter? (“pick k nearest neighbors”)

• The radius estimate can be a potential problem. Any suggestions? Should this just be one more parameter to find during fitting?

Page 35: Rafael J. Fernández-Moctezuma April 3, 2007

35

Future work

• Is a torus the right shape? How about a hypercone? Could it be easily derived on the fly from pre-computed information?

• We have considered expanding the feature vector (temperature, precipitation, etc.) These measurements are updated hourly, and sometimes available the next day. Any other sources that may make sense?