Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult...

27
Pattern Recognition and Applications Lab University of Cagliari, Italy Department of Electrical and Electronic Engineering Privacy Giorgio Fumera [email protected]

Transcript of Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult...

Page 1: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

Pattern Recognitionand Applications Lab

Universityof Cagliari, Italy

Department of Electrical and Electronic Engineering

Privacy

Giorgio Fumera

[email protected]

Page 2: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Outline

• Introduction– privacy issues in the information society– privacy in data release: microdata

• Techniques for protecting the privacy of microdata– identity disclosure: k-anonymity– attribute disclosure: ℓ-diversity, t-closeness– differential privacy

• Application examples– privacy-preserving data mining– location data– social networks

• Privacy issues in cloud scenarios

1

Page 3: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Resources

2

Ch. 9 Privacy

• V. Ciriani et al., Theory of Privacy and Anonymity, in: Algorithms and Theory of Computation

Handbook (2nd ed.), M. Atallah and M. Blanton (eds.), CRC Press, 2009

http://spdp.di.unimi.it/papers/cdfs-theory_privacy_anonymity.pdf

• S. De Capitani di Vimercati et al., Data Privacy: Definitions and Techniques, Int. J. of

Uncertainty, Fuzziness and Knowledge-Based Systems, 20(6): 793–818 (2012)

http://spdp.di.unimi.it/papers/ijufks2012.pdf

• P. Samarati and S. De Capitani di Vimercati, Cloud Security: Issues and Concerns, in:

Encyclopedia of Cloud Computing, S. Murugesan and I. Bojanova (eds.), Wiley, 2016

http://spdp.di.unimi.it/papers/sd-cloud_security.pdf

Page 4: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Introduction

3

Page 5: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in the information society

Privacy: a multifaceted concept whose meaning is context-dependent.

In the ICT field several aspects lead to privacy issues– huge amount of personal data collected, stored, and processed

(including user-generated data)– unclear data ownership– lack of control of the users on their own data– restricted access to information and its expensive processing are no

more valid protection measures

The rapid evolution of the ICT landscape leads to ever-changing privacy issues and privacy protection needs.

4

Page 6: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in the information society

Main kinds of data that are collected, stored, analysed and shared in digital form

– personal information acquired during online activities in everyday life• Internet browsing• social networks• online transactions• ...

– data released by public and private organisations (e.g., census data, businness data, medical data) for research or statistical purposes, or because of laws and regulations• aggregate statistical data• data about specific individuals or organizations

– outsourcing data storage and computation (cloud services)

5

Page 7: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in the information society

Examples of privacy protection needs– the identity of users should be protected– sensitive information about users should be kept private– users’ actions (e.g., Web browsing data) should not be traceable

Protecting privacy is increasingly difficult due to– the availability of different information sources whose analysis and

correlation (linking) can allow leakage of information not intended for disclosure

– the availability of sophisticated techniques (e.g., data mining) to automatically analyse and correlate huge sources of information

6

Page 8: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in the information society

In current ICT landscape users interact with remote information sources to using on-line services and for retrieving data.

Three main technological aspects of privacy can be identified in this context:

– privacy of the user– privacy of the communication– privacy of the information

7

Page 9: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy of the user

Protecting the identities of the parties that communicate through a network, to avoid tracing

– who is communicating with whom– who is interacting with which server or searching for which data

Main solution: techniques and protocols to guarantee an anonymous communication (e.g., Onion Routing, Tor)

– sender anonymity– recipient anonymity

8

Page 10: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy of the communication

Two main aspects related to confidentiality of the information– protecting the content of personal information sent through a

network – main kind of technique: encryption protocols (e.g., SSL)– protecting the content of service requests against misuse by providers

(e.g., against user profiling)• private information retrieval• secure multi-party computation• privacy-preserving statistical analysis• privacy-preserving data mining

9

Page 11: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy of the information

Privacy of the information refers to data collected, stored and possibly publicly released by public and private organizations about individuals and organizations

– definition of privacy policies (e.g., EU's General Data Protection Regulation – GDPR)• data holder's responsibility of data use and dissemination• user's right on data use, dissemination, disclosure, correction

– development of technologies for ensuring data protection

Main issue: protecting the anonymity of data owners– identity disclosure protection (against re-identification)– attribute disclosure protection (sensitive data)– inference channel protection (inference, data association)

10

Page 12: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy of the information

To protect user anonymity specific norms limit the use of collected data to specific purposes (historical, statistical or scientific), provided that appropriate safeguards are applied.

Safeguards depend on the data release method. Two main data release methods exist:

– macrodata and statistical tables– microdata

This course shall focus on microdata privacy protection

11

Page 13: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Data release: macrodata, statistical databases

Main form of data release in the past– macrodata: aggregate information (statistics) on users or

organizations, usually in the form of two-dimensional tables– statistical tables: databases from which only aggregate statistics can

be retrieved by users through a DBMS

Different organizations need to make these kinds of data publicly available, e.g.:

– government agencies: historical data (e.g., census data, medical data)– private organizations: businness-related data (e.g., products and

sales)

Some examples:– EUROSTAT (the statistical office of the European Union)– ISTAT (Italian National Institute of Statistics)

12

Page 14: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Data release: macrodata, statistical databases

Main protection techniques– macrodata: selective obfuscation of sensitive cells– statistical databases

• restricting the statistical queries that can be made or the data that can be published

• returning the user a modified result, either at storage time or at run time

13

Page 15: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Data release: microdata

Nowadays the release of microdata, i.e., data about specificindividuals or organizations (respondents), is necessary

– pros: increased flexibility and availability of information to users– cons: increasing risks of privacy breaches against the anonymity of

respondents

Microdata are usually released as two-dimensional tables.– a toy example for

medical data

14

Page 16: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in microdata

Basic measure to protect user anonymity: de-identification– encrypting identifiers– removing identifiers

15

Page 17: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Privacy issues in microdata: data linking

However de-indentification does not guarantee anonymity.Other attributes, named quasi-identifiers (e.g., date of birth, sex, ZIP code), can be linked with external and publicly availableinformation to

– re-identify respondents– reduce the uncertainty

on their identities– infer sensitive information

not intended for disclosure

16

Page 18: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Data linking: toy example

17

Page 19: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Real-world examples of data linking

• U.S. census data (2000)• America OnLine (AOL) incident (2006)• Netflix incident (2006)

18

Page 20: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

U.S. census data (2000)

A study carried out in 2006 showed that a considerable fraction of the U.S. population can be uniquely identified by

– gender

– location (either ZIP code or county)

– date of birth (year, year and month, full date)

19

P. Golle, Revisiting the Uniqueness of Simple Demographics in the US Population, Proc. WPES’06,pp. 77–80, ACM, 2006. Available at: https://crypto.stanford.edu/~pgolle/papers/census.pdf

Page 21: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

AOL incident (2006)

America OnLine (AOL, an Internet services and media company) released in 2006 around 20 million search records of 650,000 customers for research purposes.

Records were de-identified by replacing personal identifiers with numerical identifiers (ID).

Records contained– ID– the term(s) used for the search– the timestamp– whether the user clicked on a result, and the corresponding website

20

Page 22: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

AOL incident (2006)

A sample of the data relesed by AOL, related to user IDs 116874 and 117020:

116874 thompson water seal 2006-05-24 11:31:36 1 http://www.thompsonwaterseal.com116874 knbt 2006-05-31 07:57:28116874 knbt.com 2006-05-31 08:09:30 1 http://www.knbt.com117020 texas penal code 2006-03-03 17:57:38 1 http://www.capitol.state.tx.us117020 homicide in hook texas 2006-03-08 09:47:35117020 homicide in bowle county 2006-03-08 09:48:25 6 http://www.tdcj.state.tx.us

21

Page 23: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

AOL incident (2006)

22

Two reporters of the New York Times were able to re-identify the AOL customer with ID 4417749:• Thelma Arnold• 62 yeas old widow• living in Lilburn

The released data were immediately removed.

Page 24: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Netflix incident (2006)

Netflix (an on-line movies renting service) launched in 2006 the Netflix Prize competition, offering $1 million to anyone who could improve its movie recommendation algorithm based on customer stored data.

To this aim Netflix released 100 million records containing the ratings given by 500,000 users to the movies they rent.

Records were de-identified by replacing personal identifiers with numerical identifiers.

23

Page 25: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Netflix incident (2006)

24

Some researchers were able to de-anonymize the data by comparing it with publicly available ratings on the Internet Movie Database (IMDB).

As an example, a lesbian mother was re-identified, causing the disclosure of her sexual orientation.

The contest was canceled after a privacy lawsuit.

Page 26: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Research participant identification (2013)

25

A study published in 2013 showed that the identities of people who participate in genetic research studies can be discovered by cross-referencing their data with publicly available information:https://www.nature.com/news/privacy-protections-the-genome-hacker-1.12940

Page 27: Pattern Recognition and Applications Lab Privacy · Protecting privacy is increasingly difficult due to – the availability of differentinformation sources whose analysis and correlation

http://pralab.diee.unica.it

Relevant sources of microdata: an example

Statistical institutes:– EUROSTAT – The statistical office of the European Union

• https://ec.europa.eu/eurostat/• information on available microdata:

https://ec.europa.eu/eurostat/web/microdata/public-microdata– ISTAT – Istituto Nazionale di Statistica (Italian National Institute of

Statistics)• https://www.istat.it – https://www.istat.it/en• information on available microdata:

https://www.istat.it/en/analysis-and-products/microdata-files

26