You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You...

32
Introduction Experimental Setting Method Experimental Results Conclusion You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas A&M University Presented by Yi Zhu November 16, 2010 Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 1 / 32

Transcript of You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You...

Page 1: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

You Are Where You Tweet:A Content-Based Approach to Geo-locating

Twitter Users

CIKM’10Zhiyuan Cheng, James Caverlee and Kyumin Lee

Texas A&M University

Presented by Yi Zhu

November 16, 2010

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 1 / 32

Page 2: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Outline

1 IntroductionMotivationProblem

2 Experimental SettingDatasetEvaluation Metrics

3 MethodBaselineIdentifying Local WordsTweet Sparsity

4 Experimental Results

5 Conclusion

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 2 / 32

Page 3: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Examples

Mongkok

Hong Kong

Object: Locating a Twitter user based on the contentof tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 3 / 32

Page 4: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Motivation

Motivation

Location sparsity problem of Twitter26% users have listed a user location as granular as acity name.Twitter begin to support per-tweet geo-tagging sinceAugust 2009. However, fewer than 0.42% tweets aretagged.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 4 / 32

Page 5: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Motivation

Motivation

Personalized information servicesLocal news providingRegional advertisementsLocation-based application (earthquake detection)

Avoid the need for sensitive data (private userinformation, IP address)

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 5 / 32

Page 6: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Challenges

Tweets status updates are noisy. Mixing a variety ofdaily interests.

Twitter users often rely on shorthand andnon-standard vocabulary for informalcommunication.

A user may span multiple locations beyond theirimmediate home location.

A user may have more than one associatedlocations.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 6 / 32

Page 7: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Problem Defined

Given tweets of Twitter users, our goal is to estimatethe city-level location of a user based purely on thecontent of their tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 7 / 32

Page 8: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Problem Defined

Formally, the location estimation problem is definedas follows:

Given a set of tweets Stweets(u) posted by user u;Estimate a user’s probability of being located in city i :p(i |Stweets(u)), such that the city with maximumprobability lest(u) is the user’s actual location lact(u).

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 8 / 32

Page 9: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Data Crawling

API: twitter4j (open-source library for java).

Two crawling strategies:Crawling through Twitter’s public timeline API. (ActiveTwitter Users)Crawling by breadth-first search through social edges tocrawl each user’s friends. (Sub Social Graph of Twitter)

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 9 / 32

Page 10: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Dataset Description

From Sep 2009 to Jan 2010

Users: 1,074,375

Tweets: 29,479,600

75.05% users list location, but overly general(California) or nonsensical (Wonderland).

21% users list a location as granular as a city name.

5% users list latitude/longitude coordinate.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 10 / 32

Page 11: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Dataset Filter

Filter all listed locations that have a valid city-levellabel.

Users: 130,689

Tweets: 4,124,960

Test Set:Extract users with 1000+ tweets and latitude/longitudecoordinates. (Generated by smartphone)Users: 5,190Tweets: more than 5 million

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 11 / 32

Page 12: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Evaluation Metrics

Evaluation Metrics

Error Distance for user uErrDist(u) = d(lact(u), lest(u))

Average Error Distance for all users U:

AvgErrDist(U) =∑

u∈U ErrDist(u)|U|

Accuracy:

Accuracy(U) = |{u|u∈U ∧ ErrDist(u)6100}||U|

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 12 / 32

Page 13: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Baseline

Baseline Location Estimation

p(i |Swords(u)) =∑

w∈Swords(u)p(i |w)× p(w).

Swords(u) is the set of words extracted from user u.

p(w) is the probability of the word w in the wholedataset, p(w) = count(w)

t

p(i |w) the likelihood that each word w is issued by auser located in city i .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 13 / 32

Page 14: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Baseline

Baseline Location Estimation Result

Accuracy: 10.12%

AvgErrDist: 1773 miles

Problem:Local Words: isolate the words which can distinguishlocation of the user.Tweet Sparsity: location sparsity of words in tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 14 / 32

Page 15: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Spatial variation model

Given a word, decide if it is local or non-local.

Spatial variation model (Backstrom et al., WWW’08)Analysis of geographic distribution of terms in searchengine query logs.Cd−α is the approximately probability of the queryissued from a place with a distance d from the center.C is a constant to specify the frequency of the center.α control the speed of the frequency falls.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 15 / 32

Page 16: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

C and α can be used to determine if the word islocal.

For a word w , given a center and the centralfrequency is C, compute the maximum-likelihoodvalue.

For each city i , users from i tweet word w n times:n > 0, then multiply the overall probability by (Cd−αi )n.n = 0, then multiply the overall probability by 1− Cd−αi .di is the distance between city i and the center of wordw .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 16 / 32

Page 17: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

To avoid underflow, logarithms are added.

Suppose S is the set of occurrences for word w ,then:

f (C, α) =∑i∈S

log Cdi−α +

∑i /∈S

log(1− Cdi−α)

It has exactly one local maximum (unimodal)LatticesGolden section search

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 17 / 32

Page 18: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 18 / 32

Page 19: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 19 / 32

Page 20: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 20 / 32

Page 21: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Laplace Smoothing (Add-One Smoothing)

p(i |w) = 1+count(w ,i)V+N(w)

,

count(w , i): term count of word w in city i ;

V : the size of vocabulary;

N(w): total count of w in all cities.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 21 / 32

Page 22: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

State-Level Smoothing

State probability:

ps(s|w) =∑

i∈Sc p(i|w)

|Sc | ,

Sc : set of cities in the state s.

State-level smoothing:

p′(i |w) = λ× p(i |w) + (1− λ)× ps(s|w),i : a city in the state s;1− λ: amount of smoothing.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 22 / 32

Page 23: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Lattice-Based Neighborhood Smoothing

Per-lattice probability:

p(lat |w) =∑

i∈Scp(i |w),

lat : a lattice.Sc : set of cities in lat .

Lattice probability:

p′(lat |w) = µ∗p(lat |w)+(1− µ) ∗∑

lati∈Sneighbors

p(lati |w),

µ: parameter.neighbors: 8 lattice around lat .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 23 / 32

Page 24: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Lattice-Based Neighborhood Smoothing

Lattice-based neighborhood smoothing:

p′(i |w) = λ ∗ p(i |w) + (1− λ) ∗ p′(lat |w),i : a city in the lattice lat ;λ: smoothing parameter.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 24 / 32

Page 25: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Model-Based Smoothing

p′(i |w) = C(w)d−α(w)i ,

C(w), α(w): optimized parameters for word w .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 25 / 32

Page 26: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Smoothing Comparison

Geographic Range Parameters ComplexityLaplace None None Low

State-Level State λ HighNeighborhood Neighbor Lattices µ, λ HighestModel-Based Global None Lowest

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 26 / 32

Page 27: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Model and Smoothing Comparison

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 27 / 32

Page 28: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Model and Smoothing Comparison

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 28 / 32

Page 29: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Capacity of Estimator

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 29 / 32

Page 30: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Number of Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 30 / 32

Page 31: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Conclusion

A probabilistic framework for estimating city-levellocation of Twitter users based on the content oftweets.

Local words identifying and some smoothing canimprove the estimation

100 tweets are enough for locating.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 31 / 32

Page 32: You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter … · 2017-03-09 · You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10

Introduction Experimental Setting Method Experimental Results Conclusion

Thanks!

Q & A

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 32 / 32