Data Collection without Privacy Side Effects
-
Upload
josep-m-pujol -
Category
Science
-
view
700 -
download
1
Transcript of Data Collection without Privacy Side Effects
CLIQZ @ BIG 2016…
Data Collection without Privacy Side-Effects Konark ModiJosep M. Pujol
@konarkmodi@solso
CLIQZ @ BIG 2016…
Data collection on Big Data
Where does the data of Big Data comes from? The Elephant in the room
Applications of Big Data
CLIQZ @ BIG 2016…
Who collects data on the Web?
Wired, Ebay and Meetic collect data as 1st parties as a user visits/interacts with their sites. However there are a lot of 3rd parties that also collect data. On CLIQZ’s paper: “Tracking the Trackers”. To be presented at WWW 2016 [1] >> 78% of page loads send information to at least one 3rd party that is deemed unsafe wrt privacy.
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversationHi, this is company X. CLIQZ anti-tracking is affecting us.
Can we talk?
We are not trackers. We only measure audiences (or collect aggregated or measure goal conversion or site performance metrics). We take privacy very seriously.
Sure
Understood, let us check what’s going on
Well, you are actually tracking users. See the attachment. You have the ability to know that these 20 webpages were visited by the same person, and
to make things worse, you can derive his real identity. Users privacy is at risk
Thanks a lot
No, no. We do NOT use that information at all, we remove it as soon it is received. We are only interesting
in measuring XYZ. But we just show you an example of tracking. Intentionally or not does not should not matter,
right? I repeat that we are NOT using this data at all for anything, see our Privacy Policy. To implement our
service we require that data element that can be used as user identifier, there is no other way…
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversation
Unfortunately, they never come back L. We formulated 3 hypotheses:
1) They were interested in collecting data from users. They are “intentionally” tracking. 2) They are not concerned about privacy side-effects. On the trade-off between privacy and convenience, chose the later. 3) We could not successfully explain our alternative approach for privacy-preserving data collection.
...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it.
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversation
We hope that it is not #1, that’s why we decided:
• To open-source a prototype of a Google Analytics look-
alike that does not rely on tracking. Hoping that the code will be more explanatory.
• To write this paper and presentation.
...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it.
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
An Example of Unintentional TrackingGoogle Analytics (GA) • GA is massive, present in
44% of all page loads. • GA does not offer any
service (public) that requires to build the a session with all user’s activity
• GA actually cares a lot about privacy – Ephemeral UIDs – Sanitization of URLs
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
www.meetic.com/home/index.php 09:59:01 [137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
www.meetic.com/home/index.php 09:59:01 [137.9.10.X,
1140x645]
analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
www.meetic.com/home/index.php 09:59:01 [137.9.10.X,
1140x645]
analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,
1140x645]
Last page is only accessible after login and it contains my username => Personal Identifiable Information (PII) leak.
IP: 137.9.10.XX https://www.google- analytics.com/collect? … dl=https%3A%2F %2Fanalytics.twitter.com%2Fuser%2Fsolso %2Fhome& ... &vp=1140x645&...
Privacy Breaches are Unavoidable (even for GA)
CLIQZ @ BIG 2016…
Example: Counting Unique Visitorswired.com/xyz 09:48:40 82.143.2.X
wired.com/xyz 09:48:42 137.9.10.X
wired.com/xyz 09:48:59 137.9.10.X
wired.com/xyz 09:49:12 137.9.10.X
4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?
GA backend
CLIQZ @ BIG 2016…
Example: Counting Unique Visitorswired.com/xyz 09:48:40 82.143.2.X
wired.com/xyz 09:48:42 137.9.10.X
wired.com/xyz 09:48:59 137.9.10.X
wired.com/xyz 09:49:12 137.9.10.X
4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?
GA backend
wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]
wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]
wired.com/xyz 09:48:59 [137.9.10.X, 940x645]
wired.com/xyz 09:49:12 [137.9.10.X, 940x645]
GA backend
Identifying which records come from the same person to avoid over-counting. A UID is needed 4 visits, 3 unique visitors
CLIQZ @ BIG 2016…
Example: Counting Unique Visitorswired.com/xyz 09:48:40 ---
wired.com/xyz 09:48:42 ---
wired.com/xyz 09:48:59 ---
wired.com/xyz 09:49:12 ---
4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved?
GA backend
wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]
wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]
wired.com/xyz 09:48:59 [137.9.10.X, 940x645]
wired.com/xyz 09:49:12 [137.9.10.X, 940x645]
GA backend
Identifying which records come from the same person to avoid over-counting. A UID is needed 4 visits, 3 unique visitors
wired.com/ 09:49:12 [137.9.10.X, 1140x645]
ebay-kleinanzeigen.de/s-muenchen/cyclocross/
k0l6411r20009:50:02 [137.9.10.X,
1140x645]
twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
www.meetic.com/home/index.php 09:59:01 [137.9.10.X,
1140x645]
analytics.twitter.com/user/solso/home 10:05:45 [137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
As long as aggregation of data per user on the server-side is needed, we will always incur on undesired privacy side-effects.
CLIQZ @ BIG 2016…
Since server-side aggregation is the root of the problem, we should move the aggregation of data to the client-side (i.e. the user’s browser)
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
Browser Browser
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
state = []
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [ H(wired.com/xyz,
unique-visit, timestamp)]
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [ H(wired.com/xyz,
unique-visit, timestamp)]
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
Count Uniques
Count Uniques
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
Browser Browser
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browserstate = [
H(wired.com/xyz, unique-visit, timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [ H(wired.com/xyz,
unique-visit, timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [ H(wired.com/xyz,
unique-visit, timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz Possible if you control the
browser (i.e. CLIQZ). But also possible with
HTML5 LocalStorage and PostMessage APIs.
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz wired.com/xyz3rd party tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
3rd party tracking
script
Browser Browser
visitwired.com/xyz
state = [ H(wired.com/xyz,
unique-visit, timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…Server-sideAggrega-on–GoogleAnaly-cs
wired.com/xyz [137.9.10.X, 940x645]
GABackend CGTBackend
Client-sideAggrega-on–CLIQZGreenTracker
Browser Browser
visitwired.com/xyz
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
Count Uniques
Count Uniques
CLIQZ @ BIG 2016…
Beyond Counting Unique Visitors?Working prototype of a GA-clone featuring:
– Unique visits and page loads. – Returning customers. – Goal conversion to track campaigns. – Cross site correlations. – In-site click-troughs. – Visits and time in page per user (without beacons).
A privacy preserving tracking agent: green-tracker, which implements all this 6 use-cases in less than 200 lines of code. Demo: http://site1.test.cliqz.com/
CLIQZ @ BIG 2016…
ConclusionsData collection based on server-side aggregation of user’s data is very problematic as it implies tracking users. Tracking leads to to privacy side-effects, we provided evidence of privacy leaks on Google Analytics. Tracking can be avoided if one switches the design pattern to client-side aggregation. To demonstrate the feasibility of client-side aggregation we build and open-sourced a Google Analytics look-alike: https://github.com/cliqz/green-tracker that implements on a privacy preserving way a wide range of use-cases that require tracking users.
CLIQZ @ BIG 2016…
Q&A
Thanks for your attention!
CLIQZ @ BIG 2016…
Appendix
CLIQZ @ BIG 2016…
Keeping State on the ClientModern browsers have the ability to keep state via HTML5 LocalStorage. Therefore, a – privacy preserving tracking script – can keep a persistent state across multiple sites if loaded from an IFRAME
• Looks pretty familiar, but is slightly different:
– LocalStorage belongs to green-tracker.fbt.co (the collector backend) – Respects CORS – IFRAME is sandboxed (no access to Document) – Explicit control from site-owner (postMessage) – Explicit control from user (messages and state can be removed and inspect at will)
CLIQZ @ BIG 2016…
LimitationsAs always, there are limitations that one must consider: • Deploy is not immediate. It requires code changes both in the
tracking script and collectors. • Unplanned use-cases might not be possible retrospectively. • Business logic of the data collector is explicit to the user. • The state of the client can become a privacy issue if not handled
properly; careful of not creating a duplicated history. • Browser might have factory-default options that prevent
LocalStorage to work as expected. For instance, Safari blocks 3rd party cookies which affect LocalStorage, the user can change the setting but this is sub-optimal.