Data Collection without Privacy Side Effects

36
CLIQZ @ BIG 2016Data Collection without Privacy Side-Effects Konark Modi Josep M. Pujol @konarkmodi @solso

Transcript of Data Collection without Privacy Side Effects

Page 1: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Data Collection without Privacy Side-Effects Konark ModiJosep M. Pujol

@konarkmodi@solso

Page 2: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Data collection on Big Data

Where does the data of Big Data comes from? The Elephant in the room

Applications of Big Data

Page 3: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Who collects data on the Web?

Wired, Ebay and Meetic collect data as 1st parties as a user visits/interacts with their sites. However there are a lot of 3rd parties that also collect data. On CLIQZ’s paper: “Tracking the Trackers”. To be presented at WWW 2016 [1] >> 78% of page loads send information to at least one 3rd party that is deemed unsafe wrt privacy.

Page 4: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Motivation: A recurring real-life conversationHi, this is company X. CLIQZ anti-tracking is affecting us.

Can we talk?

We are not trackers. We only measure audiences (or collect aggregated or measure goal conversion or site performance metrics). We take privacy very seriously.

Sure

Understood, let us check what’s going on

Well, you are actually tracking users. See the attachment. You have the ability to know that these 20 webpages were visited by the same person, and

to make things worse, you can derive his real identity. Users privacy is at risk

Thanks a lot

No, no. We do NOT use that information at all, we remove it as soon it is received. We are only interesting

in measuring XYZ. But we just show you an example of tracking. Intentionally or not does not should not matter,

right? I repeat that we are NOT using this data at all for anything, see our Privacy Policy. To implement our

service we require that data element that can be used as user identifier, there is no other way…

There is another way. Happy to show you …

Page 5: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Motivation: A recurring real-life conversation

Unfortunately, they never come back L. We formulated 3 hypotheses:

1) They were interested in collecting data from users. They are “intentionally” tracking. 2) They are not concerned about privacy side-effects. On the trade-off between privacy and convenience, chose the later. 3) We could not successfully explain our alternative approach for privacy-preserving data collection.

...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it.

There is another way. Happy to show you …

Page 6: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Motivation: A recurring real-life conversation

We hope that it is not #1, that’s why we decided:

•  To open-source a prototype of a Google Analytics look-

alike that does not rely on tracking. Hoping that the code will be more explanatory.

•  To write this paper and presentation.

...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it.

There is another way. Happy to show you …

Page 7: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

An Example of Unintentional TrackingGoogle Analytics (GA) •  GA is massive, present in

44% of all page loads. •  GA does not offer any

service (public) that requires to build the a session with all user’s activity

•  GA actually cares a lot about privacy –  Ephemeral UIDs –  Sanitization of URLs

Page 8: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Privacy Breaches are Unavoidable (even for GA)

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

Page 9: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Privacy Breaches are Unavoidable (even for GA)

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

Page 10: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Privacy Breaches are Unavoidable (even for GA)

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]

Page 11: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Privacy Breaches are Unavoidable (even for GA)

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]

www.meetic.com/home/index.php 09:59:01 [137.9.10.X,

1140x645]

Page 12: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Privacy Breaches are Unavoidable (even for GA)

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]

www.meetic.com/home/index.php 09:59:01 [137.9.10.X,

1140x645]

analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,

1140x645]

Page 13: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]

www.meetic.com/home/index.php 09:59:01 [137.9.10.X,

1140x645]

analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,

1140x645]

Last page is only accessible after login and it contains my username => Personal Identifiable Information (PII) leak.

IP: 137.9.10.XX https://www.google- analytics.com/collect? … dl=https%3A%2F %2Fanalytics.twitter.com%2Fuser%2Fsolso %2Fhome& ... &vp=1140x645&...

Privacy Breaches are Unavoidable (even for GA)

Page 14: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Example: Counting Unique Visitorswired.com/xyz 09:48:40 82.143.2.X

wired.com/xyz 09:48:42 137.9.10.X

wired.com/xyz 09:48:59 137.9.10.X

wired.com/xyz 09:49:12 137.9.10.X

4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?

GA backend

Page 15: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Example: Counting Unique Visitorswired.com/xyz 09:48:40 82.143.2.X

wired.com/xyz 09:48:42 137.9.10.X

wired.com/xyz 09:48:59 137.9.10.X

wired.com/xyz 09:49:12 137.9.10.X

4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?

GA backend

wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]

wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]

wired.com/xyz 09:48:59 [137.9.10.X, 940x645]

wired.com/xyz 09:49:12 [137.9.10.X, 940x645]

GA backend

Identifying which records come from the same person to avoid over-counting. A UID is needed 4 visits, 3 unique visitors

Page 16: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Example: Counting Unique Visitorswired.com/xyz 09:48:40 ---

wired.com/xyz 09:48:42 ---

wired.com/xyz 09:48:59 ---

wired.com/xyz 09:49:12 ---

4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?

GA backend

wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]

wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]

wired.com/xyz 09:48:59 [137.9.10.X, 940x645]

wired.com/xyz 09:49:12 [137.9.10.X, 940x645]

GA backend

Identifying which records come from the same person to avoid over-counting. A UID is needed 4 visits, 3 unique visitors

wired.com/ 09:49:12 [137.9.10.X, 1140x645]

ebay-kleinanzeigen.de/s-muenchen/cyclocross/

k0l6411r20009:50:02 [137.9.10.X,

1140x645]

twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]

www.meetic.com/home/index.php 09:59:01 [137.9.10.X,

1140x645]

analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,

1140x645]

Page 17: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

As long as aggregation of data per user on the server-side is needed, we will always incur on undesired privacy side-effects.

Page 18: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Since server-side aggregation is the root of the problem, we should move the aggregation of data to the client-side (i.e. the user’s browser)

Page 19: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

Browser Browser

Page 20: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

Page 21: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

state = []

Page 22: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

state = [ H(wired.com/xyz,

unique-visit, timestamp)]

Page 23: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

state = [ H(wired.com/xyz,

unique-visit, timestamp)]

Page 24: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

Count Uniques

Count Uniques

Page 25: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

Browser Browser

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz

Page 26: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browserstate = [

H(wired.com/xyz, unique-visit, timestamp)]

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz

Page 27: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

state = [ H(wired.com/xyz,

unique-visit, timestamp)]

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz

Page 28: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visit

wired.com/xyz unique-visit

wired.com/xyz

state = [ H(wired.com/xyz,

unique-visit, timestamp)]

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz Possible if you control the

browser (i.e. CLIQZ). But also possible with

HTML5 LocalStorage and PostMessage APIs.

Page 29: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz wired.com/xyz3rd party tracking

script

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

3rd party tracking

script

Browser Browser

visitwired.com/xyz

state = [ H(wired.com/xyz,

unique-visit, timestamp)]

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz

Page 30: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs

wired.com/xyz [137.9.10.X, 940x645]

GABackend CGTBackend

Client-sideAggrega-on–CLIQZGreenTracker

Browser Browser

visitwired.com/xyz

wired.com/xyz [137.9.10.X, 940x645] visit

wired.com/xyz unique-visit

wired.com/xyz

Count Uniques

Count Uniques

Page 31: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Beyond Counting Unique Visitors?Working prototype of a GA-clone featuring:

–  Unique visits and page loads. –  Returning customers. –  Goal conversion to track campaigns. –  Cross site correlations. –  In-site click-troughs. –  Visits and time in page per user (without beacons).

A privacy preserving tracking agent: green-tracker, which implements all this 6 use-cases in less than 200 lines of code. Demo: http://site1.test.cliqz.com/

Page 32: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

ConclusionsData collection based on server-side aggregation of user’s data is very problematic as it implies tracking users. Tracking leads to to privacy side-effects, we provided evidence of privacy leaks on Google Analytics. Tracking can be avoided if one switches the design pattern to client-side aggregation. To demonstrate the feasibility of client-side aggregation we build and open-sourced a Google Analytics look-alike: https://github.com/cliqz/green-tracker that implements on a privacy preserving way a wide range of use-cases that require tracking users.

Page 33: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Q&A

Thanks for your attention!

Page 34: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Appendix

Page 35: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

Keeping State on the ClientModern browsers have the ability to keep state via HTML5 LocalStorage. Therefore, a – privacy preserving tracking script – can keep a persistent state across multiple sites if loaded from an IFRAME

•  Looks pretty familiar, but is slightly different:

–  LocalStorage belongs to green-tracker.fbt.co (the collector backend) –  Respects CORS –  IFRAME is sandboxed (no access to Document) –  Explicit control from site-owner (postMessage) –  Explicit control from user (messages and state can be removed and inspect at will)

Page 36: Data Collection without Privacy Side Effects

CLIQZ @ BIG 2016…

LimitationsAs always, there are limitations that one must consider: •  Deploy is not immediate. It requires code changes both in the

tracking script and collectors. •  Unplanned use-cases might not be possible retrospectively. •  Business logic of the data collector is explicit to the user. •  The state of the client can become a privacy issue if not handled

properly; careful of not creating a duplicated history. •  Browser might have factory-default options that prevent

LocalStorage to work as expected. For instance, Safari blocks 3rd party cookies which affect LocalStorage, the user can change the setting but this is sub-optimal.