Analysis, modelling and protection of online private data.

Advisors:

Dr. Jordi FORNE MUÑOZDr. David REBOLLO MONEDERO

In partial fulfilment of the requirements for the degree of: Doctor of philosophy.

Silvia [email protected]

Analysis, modelling and protection of online private data.

Agenda

Background. Introduction and scope of the investigation.

Objectives. Objectives of the investigation.

Ongoing and future work. Publications and current research efforts.

Background

Online privacyIs Privacy the right to be forgotten?

In 2011, the amount of digital information created and replicated globally exceeded 1.8 zettabytes (1.8 trillion gigabytes).

75% of this information is created by individuals through new media fora such as blogs and via social networks.

By the end of 2011, Facebook had 845 million monthly active users, sharing over 30 billion pieces of content.

Library Briefing - Library of the European Parliament - 01/03/2012

What is online privacy anyway?In an online context, the right to privacy has commonly been interpreted as a right to “information self-determination”.

Acts typically claimed to breach online privacy concern the collection of personal information without consent, the selling of personal information and the further processing of that information.

Do we have online privacy?

Do we have online privacy?Irani, Danesh et al. [1] describe how personal information leaks on social networks can be used for concrete attacks.

Acquisti, Alessandro, and Ralph Gross [2] also presented a method to infer people Social Security numbers by using only publicly available information.

Goga, Oana et al. [3] describe how user activity on one site can implicitly reveal their identity onto another site.

Chen, Terence et al. [4] showed a correlation between the amount and type of information revealed in social network profiles.

The age of the “metadata”“meta-data” is collected and stored by public and private organisations about where, when and who created and accessed a particular online content.

In the private sphere it has been said that “literally, Google knows more about us than we can remember our-selves”.

This situation has led to growing concerns regarding online privacy.

In China, one estimate suggests there are over 30 000 government censors monitoring online information.

Library Briefing - Library of the European Parliament - 01/03/2012

Ex: Google Conversion Tracking<html><body>



<a onclick="goog_report_conversion('tel:949-555-1234')" href="#" >CALL NOW</a></body></html>

Some websites implement Google forwarding number that measures the calls made by potential customers.

Why metadata matters?Metadata is more interesting than actual information. Ex:● They know you called the suicide prevention hotline. But the actual conversation

remain secret.● They know you checked HIV related websites, talked to a HIV testing service, then

spoke to your doctor. But they don’t know what was discussed during the calls.

Furthermore, Bizer, Christian et al. [5] have shown how websites already embed structured data to describe product, services, events, and make user information available already into their HTML pages using markup standards such as Microformats, Microdata and RDFa.

Hyperdata && HypermediaHyperdata indicates data objects linked to other data objects in other places, as hypertext indicates text linked to other text in other places.

Hyperdata enables formation of a web of data, evolving from the "data on the Web" that is not inter-related (or at least, not linked).

Hypermedia, an extension of the term hypertext, is a nonlinear medium of information which includes graphics, audio, video, plain text and hyperlinks.

Source: Wikipedia

What is REST? REST, an architectural style introduced by Roy Thomas Fielding in 2000, which has been at the core of the web design and development.

REST represents an abstraction over the actual architecture of the web.

In REST identification, representation and format are independent concepts. Specifically:

An URI can identify a resource without knowing what formats the resource uses to exchange representations. Likewise the protocols and representations used by the resource to communicate can be modified independently from the URI identifying the resource.

RESTful Architectures

REST InterfacesThe uniformity of REST interfaces is build upon four guiding principles:

● The identification of resources through the URI mechanism.

● The manipulation of resources through their representations.

● The use of self-descriptive messages.

● Implementing hypermedia as engine of the application state (HATEOAS)

Hypermedia and privacy protectionInformation self-determination is not even possible if users have no control on their online footprint.

Hypermedia provides context over unstructured footprint information.

Users and applications use REST interfaces to interact with one another exchanging resource representations.

The web follows REST principles and so do users’ online traces.

Hypermedia and privacy protectionGenc,Yegin,et al. [6] introduce a method to map text message into a wide context, and by computing the distance between them, classify their content.

Ducheneaut, Nicolas et al. [7] explain how recommender systems need to incorporate contextual information from the physical world, as users move continuously and frequently engage in a variety of activities.

Sakaki, Takeshi et al. [8] discuss how real-time interaction between online users and the offline world can be used to detect target events, turning the actual users into sensors themselves.

Objectives

Objective 1

Development of a hypermedia model of the user online footprint

Objective 1This hypermedia model of the user online footprint is constructed by analysing the different interactions that the user has online with various services and platforms.

Hyperme is the proposed hyperdata model of a user online footprint.

The hyperme model links the user footprints created across different services and the features associated with them in a hypergraph.

The user footprints is therefore transformed into an object that can be explored based on some desired features.

Objective 1Users stream private information towards devices, applications and platforms.

These information is shared with groups of different people with distinct access rights.

Private (?) information is only shared with service providers.

Objective 1The hyperme model capture different aspect of user activities online:

● Everything in the hyperme model is a signal.

● Signals can be easily profiled.● Signals can be linked

between each other.● Footprints become objects

that can be explored.

Objective 1

The last two weeks activity of Stephen Fry twitter account have been analysed (@stephenfry)

https://twitter.com/stephenfry

Objective 1

Objective 2

Analysis of data flows from social networks to third party advertisers

Objective 2The aim of this objective is understanding what data is leaked by third party advertising networks and how third party advertising networks and social platforms track users as they surf the web.

The exchange of identity information is followed from the client to third party advertising platforms.

Methods implemented by third party advertising networks are discovered and classified by analysing network requests (HTTP) and actual data flow (JavaScript calls).

Mathematical distance between the user profile and the observed advertising profile is taken as a measurement of how accurately third party platforms are tracking the user.

Objective 2

Objective 3

Evaluation of different PETs in Content Recommendation Systems

Objective 3The goal of this objective is the evaluation of different PETs in Content Recommendation Systems.

Our aim is to show how a recommendation system is affected by the application of certain PETs by a part of the user population.

Users may, in fact, wish to protect their privacy while also maintaining a satisfactory level of utility of the information received by the recommendation platform.

Different levels of privacy protection are evaluated.

Objective 4

Evaluation of different PETs to prevent information leaks on third party advertising

networks

Objective 4The goal of this objective is the evaluation of different PETs to prevent third party advertising networks to pervasively track users through their browsing pattern and social platform profile.

In particular we are concerned with understanding how third-party advertising network can be prevented to access certain private data regarding the user.

Objective 5

Extension of the hyperme model to cover aspects of location identity

Objective 5This objective aims at:

Analysing the amount and extent of geographical tagged information shared through online activities.

Establish links between location information and spatial context.

Evaluate different PETs to protect user’s location privacy.

Ongoing and future work

Ongoing and future workAt the moment we are applying the hyperme hypermedia model to profile user activities online.

We are especially concerned with answering the following questions:

● How is advertising influenced by online activities?

● To what extent does social networks activity influence third party advertising?

● To what extent can mobile phone activity influence third party advertising?

● What PETs can be implemented to protect users’ privacy?

Ongoing and future workWe are collaborating with Dr. Markus Huber @ SBA Research (Vienna, Austria) on the following topics:

● Analyse Alexa Top Million websites to make a statistics of current tracking services implemented.

● Testing current anti-tracking technologies to find how effectives these are.

We are aiming at submitting a paper to the 36th IEEE Symposium on Security and Privacy.

PublicationsThe following article was submitted an article to the journal Computer Standards & Interfaces, on the topic of content based recommendation systems and privacy enhancing techniques:

S. Puglisi, J. Parra-Arnau, D. Rebollo-Monedero and J. Forne ,On Content-Based Recommendation and Users Privacy in Social Tagging Systems, Preprint submitted to Computer Standards & Interfaces, April, 2014. Submitted for publication.

I grew up with the understanding that the world I lived in was one where people enjoyed a sort of freedom to communicate with each other in privacy, without it being monitored, without it being measured or analyzed or sort of judged by these shadowy figures or systems, any time they mention anything that travels across public lines.

- Edward Snowden

Thank you.

References[1] D. Irani, S. Webb, and C. Pu, “Modeling unintended personal-information leakage from multiple online social networks,” IEEE Internet Computing, 2011.

[2] A. Acquisti and R.Gross,“Predicting social security numbers from public data,”in Proceedings of the National academy of sciences, 2009.

[3] O. Goga, H. Lei, S. H. K. Parthasarathi, G. Friedland, R. Sommer, and R. Teixeira, “Exploiting innocuous activity for correlating users across sites,” in Proceedings of the 22nd international conference on World Wide Web, 2013.

[4] T. Chen, M. A. Kaafar, A. Friedman, and R. Boreli, “Is more always merrier? a deep dive into online social footprints,” in Proceedings of the 2012 ACM workshop on Workshop on online social networks, 2012.

[5] C.Bizer, K.Eckert, R.Meusel, H.Muhleisen, M.Schuhmacher, and J.Vo lker, “Deployment of rdfa, microdata, and microformats on the web – a quantitative analysis,” in 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, In-Use track, 2013.

[6] Y.Genc,Y. Sakamoto, and J. Nickerson, “Discovering context: Classifying tweets through a semantic transform based on wikipedia,” in Foundations of Augmented Cognition. Directing the Future of Adaptive Systems, ser. Lecture Notes in Computer Science, D. Schmorrow and C. Fidopiastis, Eds. Springer Berlin Heidelberg, 2011, vol. 6780, pp. 484–492.

[7] N. Ducheneaut, K. Partridge, Q. Huang, B. Price, M. Roberts, E. Chi, V. Bellotti, and B. Begole, “Collaborative filtering is not enough? experiments with a mixed-model recommender for leisure activities,” in User Modeling, Adaptation, and Personalization, ser. Lecture Notes in Computer Science, G.-J. Houben, G. McCalla, F. Pianesi, and M. Zancanaro, Eds. Springer Berlin Heidelberg, 2009, vol. 5535, pp. 295–306.

[8] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-time event detection by social sensors,” in Proceedings of the 19th international conference on World wide web. ACM, 2010, pp. 851–860.

Analysis, modelling and protection of online private data.

Internet

Transcript of Analysis, modelling and protection of online private data.