QUANNAN LI 1,2, YU ZHENG 2, XING XIE 2, YUKUN CHEN 2, WENYU LIU 1, WEI-YING MA 2 1 DEPT. ELECTRONICS...
-
Upload
eugenia-washington -
Category
Documents
-
view
215 -
download
1
Transcript of QUANNAN LI 1,2, YU ZHENG 2, XING XIE 2, YUKUN CHEN 2, WENYU LIU 1, WEI-YING MA 2 1 DEPT. ELECTRONICS...
QUANNAN LI 1 , 2 , YU ZHENG 2 , XING XIE 2 , YUKUN CHEN 2 , WENYU LIU 1 , WEI-YING MA 2
1 D E P T. E L E C T R O N I C S A N D I N F O R M AT I O N E N G I N E E R I N G, H UA Z H O N G U N I V E R S I T Y O F S C I E N C E
A N D T E C H N O L O G Y2 M I C R O S O F T R E S E A R C H A S I A
T H E 1 6 T H A C M S I G S PAT I A L I N T E R N AT I O N A L C O N F E R E N C E O N A D VA N C E S I N G E O G R A P H I C
I N F O R M AT I O N S Y S T E M S, 2 0 0 8
Mining User Similarity Based on Location History
Presented on 26th Nov.
1. Introduction2. Related work (skipped)3. Architecture4. User Similarity Exploration5. Experiments6. Conclusion
Outline2
The pervasiveness of location-acquisition technologies such as GPS, GSM network the collection of large spatio-temporal datasets and discovering valuable knowledge about movement
behavior. use raw GPS data without much
understanding. Actually, besides the GPS data itself, people intend to
know about user intention and user interests. projects [9][12][13][15]
aiming to understand user-specific activity from individual GPS data have emerged.
Detecting locations of a user, predicting the user’s movement
1. INTRODUCTION-1 3
the correlation between users are not explored user similarity Application
Individual: discovering potential friends, share similar interests in books, music and movies.
merchants: improving their sales and marketing the first law of geography,
everything is related to everything else, but near things are more related than distant things
In this paper, to mine user similarity based on user-generated GPS. a novel approach to measure user similarity
geographically.
1. INTRODUCTION-24
GPS log a sequence of GPS points P={p1, p2, … , pn}.
Each GPS contains latitude, longitude and timestamp. GPS trajectory
connect these GPS points according to their time serials.
Stay point: stay point 1, at P3 stationary for a time (threshold).
enter a building and lose satellite signal for a time interval until coming back outdoors.
stay point 2, several GPS points (P5, P6, P7 and P8), user wanders around within a spatial region. people travel outdoors and are attracted by the
surrounding environment.
3. ARCHITECTURE3.1 Preliminary-1
5
Location history: a record of locations that an entity visited in
geographical spaces over an interval of time.
3.1 Preliminary-26
Hierarchical graph: put all users’ stay points
into a dataset and hierarchically cluster into several spatial regions in a divisive manner.
the similar stay points from various users will be assigned to the same clusters on different layers.
each user can build a directed graph
Location history representation individual hierarchical
graph User similarity
explorationFriend and location
recommendation
3.2 Architecture of HGSM 7
HGSM: Hierarchical Graph-based Similarity Measurement
The hierarchical graph an effective representation of a user’s location
history sequence property of user movement
To measure the similarity between two users, on each layer
find the same graph nodes the users shared then formulate a sequence based on these graph nodes.
measuring the similarity between two users can be transformed into a problem of sequences matching.
4. User Similarity Exploration4.1 Location History Extraction
8
demonstrates how a sequence of places is extracted from each individual’s location history user 1 and user 2 share the same graph nodes A, B and C.
Using a green curve, sequentially connect the blue nodes over these graph nodes in terms of time serials.
user 1: < C, A, B, B, C, C, B, C ><C(1), A(1), B(2), C(2), B(1), C(1)>
user 2: <A, B, C, A, A, C, A><A(1), B(1), C(1), A(2), C(1), A(1)>
Given each user’s arrival time and leaving time on each cluster
Figure 6 9
Definitions Related to Similar Sequences Similar sequences:
1. ∀ 1≤i≤m,ai=bi, i.e., the nodes at the same position of the two sequences
share the same cluster ID; 2. ∀ 1≤𝑖<𝑚, |Δ𝑡𝑖−Δ𝑡𝑖′|≤ tth
𝑡th is a pre-defined time threshold, called temporal constraint.
It denotes that the two users have similar transition times between the same regions.
4.2 Sequence Matching10
在這僅考慮出現點及停留離開的時間,並未考慮時段 ( 白天及晚上 ) ?
m-length similar sequence: If the number of nodes in a similar sequence is m, we
call this sequence m-length similar sequence.
temporal constraint is configured as 3 hours a 3-length similar sequence <𝐴(1)→𝐵(2)→𝐶(2)> is
detected
m-length similar sequence11
13
A) We detect 1-length similar sequences as follows. <A12>, <B23>, <B25>, <C31>, <C34> and
<A42>, <A12> denotes the first node of sequence 1
sharing the same node A with the second node of sequence 2.
B) depicts the process of the extension operation based on the results of the first step. If we set the temporal constraint tth to 2 hours, four 2-length similar sequences including <A12,
B23>, <A12, C34>, <B23, C34> and <C31, A42> can be retrieved.
C) based on the 2-length sequences, one 3-length similar sequence <A12, B23,
C34> can be detected.
Similar Sequence Matching-2
When calculating the score, account two factors: length of similar sequence and layer the sequence
Similarity measure of an m-length sequence (2): α(𝑚) =2𝑚−1
Similarity at single layer (3) n is the number of similar sequences the two users i is the score of the i-th similar sequence, (2). 𝑁1 and 𝑁2 denote the number of stay-points of the two
users.Similarity across multi-layer (4):
H : the total layers of the hierarchical graph. 𝑙 : the support of similarity of sequences on the l-th layer.
The lower the layer a sequence was detected, the higher score it obtains. In our experiment, 𝛽𝑙=2l-1
4.3 Similarity Measurement14
65 volunteers with GPS traces over 6 months. The total distance of logs exceeds 50,000 KM.
Stay point detection: set timeThreh to 30 minutes and distThreh to 200
meters. Clustering: algorithm called “OPTICS”
A density-based clustering algorithm one of the following conditions hold.
1) The number of users is less than two, 2) boundary rectangle is smaller than 500 meters.
we establish 4-layer hierarchical clusters the top layer :layer 1 (higher layer) and the bottom layer: layer 4 (lower layer).
5. Experiments 5.1 Settings -1
15
Sequence matching: we set tth of layer l to (H-l+1)∙T,
tth : time threshold; H: the depth of the hierarchy
H=4; l =4 , tth = T; l=1, 𝑡th = 4T. After trying a set of T, the performance of HGSM does
not change when T increases to a certain value.
Similarity measurement: set 𝛼(𝑚)=2𝑚−1, and 𝛽𝑙=2𝑙−1.
𝛼(𝑚) increases exponentially with the length of sequence (m)
the significance of similar sequences found on l-layer increase exponentially with l.
5.1 Settings -216
Ground truth: each volunteer is required to rate other users based
on individual understanding The relevance rating between two users is
asymmetric, i.e., though user A rates 2 on user B, user B may not rate 2 to A.
5.2 Evaluation Approach-117
For instance, using user Ui as a query, we
retrieve the top ten similar users based on their similarity score to Ui .
Then, a relevance vector G of the search results is formulated based on the relationship matrix.
we calculate MAP and nDCG for this retrieval.
After all the volunteers have been tested, we calculate a mean value of MAP and nDCG based on each individual’s results.
5.2 Evaluation Approach-218
Evaluation Framework: 65 people are respectively used as queries to search for each
of them the top ten similar users.
Evaluation Criterions: MAP and nDCG are employed to evaluate the
performance of our approach. mean average precision normalized discounted cumulative gain (nDCG).
MAP : the mean of the precision score a user is deemed as a relevant user
if his/her relevant level is greater than or equal to 3. the MAP of a relevance vector
𝐺=<4,0,2,3,3,1,0,2,1,1> is computed as follows:
5.2 Evaluation Approach-319
nDCG: the relative-to-the-ideal performance of information retrieval techniques [8].
The discounted cumulative gain of G computed as follows: (In our experiments, b = 2.)
Given the ideal discounted cumulative gain DCG’,
nDCG at i-th position can be computed as 𝐷𝐶𝐺 𝑖 =𝐷𝐶𝐺 𝑖 /𝐷𝐶𝐺′[𝑖].
5.2 Evaluation Approach-420
[8] Jarvelin, K., Kekalainen, J. Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems ,ACM Press(2002), 422-446
Baselines: If in the cluster 𝑐𝑖 User1 has 𝑘𝑖 stay-points and User2
has 𝑙 stay-points, the location histories of User1 and User2 can be represented as follows.
𝑢1 =<𝑘1,𝑘2,...𝑘𝑖,… ,𝑘𝑁> and 𝑢2 =<𝑙1,𝑙2,…,𝑙𝑖,… ,𝑙𝑁>. The similarity of two users by count is computed as
equation (6):
Cosine similarity and Pearson similarity are computed as equation (7) and equation (8) respectively:
5.2 Evaluation Approach-521
Seq: the similarity for sequence feature, Hier: the hierarchical property of geographic
spaces. Hier+Seq: HGSM of similarity considering
both the sequence and hierarchy properties. Count: similarity-by-count on the bottom layer Hier+Count: similarity-by-count across multi-layer. Cosine and Pearson respectively denotes the cosine
similarity and Pearson similarity on the bottom layer. Hier+Cosine and Hier+Pearson: respectively
represent the cosine similarity and Pearson similarity across multi-layers.
5.3 Experimental Results 22
HGSM advantages over cosine
similarity, Pearson similarity and similarity-by-count.
by considering the similarity across multi-layer
HGSM (Hier+Seq) leads the performance in
both nDCG@5 and nDCG@10 among these methods.
the hierarchical property of geo-space better improves the performance of Seq
MAP & nDCG23
maxLength nDCG@5 over the
maxLength. when the maxLength
exceeds 5, the performance of the ranking does not vary any more.
maxLength24
the MAP and nDCG@5 of our approach changing over the time threshold tth. the performance of our approach is improved as
the tth increases. when the time threshold increases to a certain
value, the performances reach their summit and do not vary any more.
Time Threshold25
both MAP and nDCG increase as the level of layer increases, i.e., layer 4 is more capable of discriminating similar users than layer 3
MAP & nDCG changing on different layer
26
People’s location histories imply their interests and preferences.
A framework, HGSM, enable us to consistently model each individual’s location
history, effectively measure the similarity among users.
Many applications friend recommendation and location recommendation
we explore users’ location histories on different scales of geographic spaces. The layer with relatively fine granularity enhances our
capability of precisely discriminating similar users, the layer with relatively coarse granularity enables us to
recognize high-level user behavior and further recall unobvious similar users.
6. Conclusion 27