Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department...

50
Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,[email protected]

Transcript of Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department...

Page 1: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Exploring Linkability of User Reviews

Mishari Almishari and Gene Tsudik Computer Science DepartmentUniversity of California, Irvine

malmisha,[email protected]

Page 2: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews

in 2010

Page 3: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

categoryRating

Page 4: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Rising Awareness of Privacy

Page 5: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

How Privacy apply to Reviews?

TraceabilityLinkability of Ad hoc ReviewsLinkablility of Several Accounts

Page 6: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Contribution

Extensive Study to Measure privacy/linakability in user reviews

Propose models that adequately identify authors

Page 7: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Settings & Problem Formulation

Page 8: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.
Page 9: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.
Page 10: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

IR: Identified RecordIR

IR

IR

IR

AR

AR

AR

AR

AR: Anonymous Record

Page 11: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Anonymous Record Size (AR)

Identified Record Size (IR)

Matching Model

TOP-X LinkabilityX: 1 and 10

1, 5, 10, 20,…60

Page 12: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Dataset

1 Million Reviews 2000 Users more than 300 review

Page 13: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Methodology

Naïve Bayesian Model Kullback-Leibler Model

Symmetric Version

Page 14: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Naïve Bayesian (NB)

Identified Record(IR)

Anonymous Record(AR)

Decreasing Sorted List of IRs

Page 15: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Kullback-Leibler Divergence

(KLD)

Identified Record(IR)

Anonymous Record(AR)

Increasing Sorted List of IRs

Page 16: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Maximum Likelihood Estimation

Page 17: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Tokens

Unigram: ‘a’, ….’z’ Digram: ‘aa’, ‘ab’,…,’zz’ Rating :1,2,3,4,5 Category: restaurant, Beauty and Spa,

Education

Page 18: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Lexical Token Results

Page 19: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

NB -Unigram

Size 60, LR 83%/ Top-

1LR 96% Top-

10

Page 20: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

KLD - Unigram

Size 60, LR 83%/ Top-

1LR 96% Top-

10

Page 21: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

NB Digram

Size 20, LR 97%/ Top-

1

Size10, LR 88%/ Top-

1

Page 22: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

KLD Digram

Size 60, LR 99%/ Top-

1Size 30,

LR 75%/ Top-1

Page 23: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Improvement (1): Combining Lexical and

non-Lexical ones

Page 24: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Combining in NB model Straightforward P(Rating|IR), P(Category|IR)

But for KLD? Weighted Average

Page 25: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

First, Combine Rating and Category

Second, Combine non-lexical and lexical

0.5

0.997/0.97 for Unigram/Digram

Page 26: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Token Combining Results

Page 27: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Rating, Category, and Unigram - NB

Gain, up to 20%

Size 30, 60 % To 80%

Size 60, 83 % To 96%

Page 28: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Rating, Category, and Unigram - KLD

Gain, up to 12%

Size 40, 68 % To 80%

Size 60, 83 % To 92%

Page 29: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Rating, Category, and Digram - NB

Page 30: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Rating, Category, and Digram - KLD

Page 31: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

What about Restricting Identified Record (IR) Size?

Page 32: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Anonymous Record Size (AR)

Identified Record Size (IR)

Matching Model

TOP-X LinkabilityX: 1 and 10

Page 33: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Anonymous Record Size (AR)

Identified Record Size (IR)

Matching Model

TOP-X LinkabilityX: 1 and 10

Page 34: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Restricted IR - NB

Affected by IR size

Page 35: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Restricted IR - KLD

Performed better for smaller IR

Size 20 or less, improved

The rest, comparable

Page 36: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

What about Matching All AR’s at once?

Page 37: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Anonymous Record Size (AR)

Identified Record Size (IR)

Matching Model

TOP-X LinkabilityX: 1 and 10

Page 38: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Anonymous Records (AR’s)

Identified Records (IR’s)

Matching Model

Page 39: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Improvement (2): Matching All IR’s At

Once

Page 40: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

✖✖

Page 41: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

MatchAll - Restricted

Gain, up to 16%

Size 30, From 74% To 90%

Page 42: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Matchall - Full

Gain, up to 23%

Size 20, From 35% To 55%

Page 43: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Improvement (3): For Small IR Size

Page 44: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Changing it to:0.5 + Review

Length

Page 45: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Results – Improvement (3)

Size 10, 89% To 92%

Size 7, 79% To 84%

Gain up to 5%

Page 46: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Discussion

Implications Cross-Referencing Review Spam

Non-Prolific Users Gradually becomes prolific IR of 20, Link Around 70%

Anonymous Record Size Linkability high even for small (92% for AR

of 10) 60 only 20% of min user contribution

Page 47: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Discussion (cont.)

Unigram Token Very Comparable for larger AR Entail less resources in the attach 26 VS

676

Page 48: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Future Directions

• Improving more for Small AR’s• Other Probabilistic Models• Using Stylometry

• Exploring Linkability in other Preference Databases

• More than one AR for different Users: Exploring it more

Page 49: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Conclusion

Extensive Study to Assess Linkability of User ReviewsFor large set of usersUsing very simple features

Users are very exposed even with simple features and large number of authors

Page 50: Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu.

Thank you all!