Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the...

25
TM Forum Digital Transformation World 2018 Nice, France 14-16 May Data Anonymizing API – Phase II

Transcript of Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the...

Page 1: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 1

TM Forum Digital Transformation World 2018Nice, France14-16 May

Data Anonymizing API –Phase II

Page 2: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 2

Catalyst Champion & Participants

Orange, Emilie Sirvent-Hien, Anonymization project manager and Sophie Nachman, Standards Manager

To get standardized anonymization API allows sharing of data internally within Orange and externally with partners in order to unleash services innovation , together with guarantying privacy of customers, in compliance with GDPR, General Data Protection Regulation

Vodafone Atul Ruparelia, Data Architect and Imo Ekong, Big Data Communication Specialist (Analytics CoE)

Contribute towards standard open API for anonymization/pseudonymization (using rich TMF assets) allowing data sharing with internal/external partners to drive service innovation but also protecting PII data in compliance to GDPR.

Cardinality Steve Bowker, CEO & Co-Founder, and Dejan Vujic, Head of Data Science

Cardinality have implemented one of the largest Hadoop based analytics solutions in a telco in Europe which leverages a containerised microservices based architecture making extensive use of APIs, including different data anonymization, pseudonymization and encryption within their solution.

Brytlyt Richard Heyns, CEO & Founder Brytlyt

Brytlyt leverage advanced processing on GPUs in natively parallelizable algorithms which form the foundation of their high performance data analytics and machine learning

Liverpool John Moores University, Professor Paul Morrissey,

Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management. Provided input on business drivers, CurateFX, Osterwalder Business Canvass. Data Science and Software Programmers

5G Innovation Centre (5GIC) hosted by University of Surrey5GIC members are collaborating closely to drive forward advanced wireless research, reduce the risks of implementing 5G (through their 5G testbed) and contribute to global 5G standardisation. The 5GIC has to date attracted an additional £68m from industry and regional partners.

Page 3: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 3

Catalyst BackgroundKicking off where we left before – Data Anonymization Phase I Catalyst

Mobile network operators are increasing using third parties to help deliver their overall solution offerings.

As a service provider, they can’t share data or allow access to their systems without data protection (for example data anonymization) as they have to protect the privacy of their subscribers and to adhere to the regulatory environment which is demanding with new mandatory regulations requiring data privatization.

Page 4: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 4

Data Anonymization - Marketing CampaignExample Use Case

Customer data may contain Personal Identifiable information(PII) such as name address and date of birth. A marketing campaign may decide to leverage on specific data to recommend a new product or services to customers. Data needs to be retrieved and transferred from the source to a data repository externally or 3rd party for Marketing company to be analysed for insights to offer some products to customers (Cross sell, up-sell) and drive some marketing with current or new customers for revenue growth, Vodafone would use 3rd parties for creation of marketing offers For legal privacy and security purposes personal identifiable information may need to be anonymised before data is transferred to a third party or external database.

Data Anonymization API

Data Sources

Data RepositoryMarketing

Database/Solution Provider Platform

DataMartTransaction

DataWeb

AnalyticsTransaction

Data

Page 5: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 5

What is Data Anonymization?A process by which personally identifiable information (PII) is irreversibly altered in such a way that a PII principal can no longer be identified directly or indirectly, either by the PII controller alone or in collaboration with any other party (ISO 29100:2011)

In accordance with Article G29 Opinion*, there are Three Main Principles:v Singling out: Is it possible to isolate someone in particular? v Linkability: Is it possible to link, at least, two records concerningthe same data subject ?v Inference: Is it possible to deduce information about one person?

Once a dataset is truly anonymized, GDPR no longer applies

Anonymization technologies : 3 Technical Familiesv Generalization (k-anonymization)v Randomization (noise addition, Differential Privacy)v Obfuscation

*http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf

Page 6: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 6

What is Pseudo-Anonymization ? Data Masking ?

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.Pseudonymized data can be restored to its original state with the addition of information which then allows individuals to be re-identified, while anonymized data can never be restored to its original state.

We can enlarge pseudonymization with data masking technics• Substitution• Shuffling• Number and date variance• Encryption• Nulling and deletion• Scrambling

When is it used to share data?§ inside a company as soon as possible§ with trusted partner (e.g. Vodafone – BT Openreach)§ Pseudonymised data are still personal data but less sensible

Page 7: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 7

Privacy by design methodology

Qualify the use case• What type of data? Collected on which basis? Who has access to it?• What to you want to do with protected data? • Do you need a reversible step?• What are the reidentification risks?

Find the Trade-off between utility and privacy• Choose the right technology• Validate with a privacy risk assessment

Page 8: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 8

Catalyst ObjectiveOBJECTIVE: Build a prototype and develop a new TMForum OPEN API that will allow service Providers to protect sensitive data so it can be shared with third parties without compromising privacy and meeting new regulatory requirements such as GDPR.Technical Objective:v Building upon industry agreed standards enabling a scalable platform, improving interoperability,

efficiency and transparency.v Investigating various methods of anonymizing and protecting data for use in the APIv Prototyping initial API for providing Anonymized data to third parties without compromising the privacy of

the data contentv Assessing the effectiveness of anonymization of various approaches that could be used, and the impact on

the "value" of the data shared if Data is protected 100%v Exploring ways to support a 2-way OPEN API for Anonymization (maybe pseudo anonymization / data

protection )Business Objective:v Exploring New Business Models for Telco Data-as-a-Servicev Exploring potential uses outside Telco domain e.g. for Banking Services, Media Content providers etc

Page 9: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 9

Use CasesData Protection for different needs

§ Operational usage: When treating personal data for non-production usage samples needed for test, training, sharing with internal or external supplier, employees outside Europe

§ Marketing usage: Customer knowledge improvement or new services development

§ Business usage: Data monetization with external customer

§ GDPR right to be forgotten

§ Artificial Intelligence Platform Sharing or AI Algorithm Model Training : example 5G Location and Content Prediction Algorithms

§ Data publishing for example scientific studies and open data

Page 10: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 10

Use Case Example 1 : Offer content

As A Third Party Marketing Company (Exec) I Need To Offer Content (film/tv and VoD) to a specific group of subscribers in a

specific Geographic Market based on various criteria

So That I Can Increase Content Usage and Increase Revenue for our customers

To Do This, I Need To

Access Data from the Telco's customer based (subscribers) to understand which subscribers (based on various criteria) are appropriate to target. In need this capability to be conducted under the auspice of National Data Governance rules and regulations to guarantee the Data anonymisation of the Telco subscribers

I Know I Am Successful When

There is an increase in sales of the content to a point where the service is profitable and is accepted by the Telco Governance and Security policies

Page 11: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 11

Business Canvas - To drive revenue growth and to unleash offers/products (eTOM terminology)

Page 12: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 12

Data Flow

Data Originated

Data Pseudonymized

Data Analyzed via Machine Learning

Data Deanonymized

Page 13: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 13

Anonymization API (see Demo at Catalyst stand level 2)

Page 14: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 14

CATALYST Data for DemoDemo needed to be based on publicly available data, so we selected the ADULTdata set from the University of California, Irvine – Machine Learning Repository

Page 15: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 15

Enriched UCI “Adult” dataset used (extra synthetic Telco Fields) • Based on UCI “Adult” dataset, enhanced with Telco Fields e.g. IMEI, IMSI etc….

INPUT FILE to API

OUTPUT FILE after processing

Page 16: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 16

Example Anonymization using “K-Anonymization”• AGE data is binned into ranges of 10 years

INPUT FILE to API

OUTPUT FILE after processing

Page 17: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 17

Example of processing of data using “Discretization”• Human Readable Information Mapped to discrete values on sensitive fields

INPUT FILE to API

OUTPUT FILE after processing

Page 18: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 18

Example Pseudonimization using one-way random algorithm

• IMEI data processed with a 1-way algorithm and unique IMEIs

INPUT FILE to API

OUTPUT FILE after processing

Page 19: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 19

Semi Autonomous – Anonymization/ Pseudonymization 1 Columnar Field Analysisv Identify properties of each field e.g. uniqueness, specific anonymization requirement for locations,

identity fields, phone numbers etc….2 Cross Column Correlation, Redundancy Reductionv Frequency distributions, k-mean analysis of combinations of columnsv Identification of Personally Identifiable Information (PII) combinationsv Removal of redundant information 3 Data Noise Injection / Synthetic Data Generationv Injection of additional synthetic data to ensure anonymization process 4 Anonymization / Pseudonymization Processv Individual columns are anonymized / pseudonymized in order to meet regulatory requirements5 Key retrieval only possible on source network (for returned results)v Data is temporal in API and will time out (plus be deleted where credentials are invalid)v Only source that sent original data can retrieve keys with a certificate / location aware process

Page 20: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 20

Use Case Example 2: Data publishing

As A Statistical InstituteI Need To Publish open data on country citizen

So That I Can Help the state to build and decide their policy

To Do This, I Need To

Access Data from the Telco's customer based (subscribers) to compute my indicator without compromising customer’s privacy and in respect to Data Protection regulation

I Know I Am Successful When

The statistical results from anonymized data are usable enough to compute my indicators

Page 21: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 2122/05/2018 21

Ecample available on UCI website free to sharehttps://archive.ics.uci.edu/ml/datasets.html

Qualification

Datas from the CRM systems collected under contract basisOperator marketing team has access to the data to improve its offer and developp new ones but want to publish them without legal basisNo Reversible step needed-> Anonymization needed

Page 22: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 22

• Avoid singling out, linking and inference

• Qualify the quality of data for specific usage

• Choose Orange technology*: co-clustering for microdata anonymization to generate syntheticdata ( with a minimum K cluster size example 200)

22/05/2018 22

Reidentifiation risks mitigation

Csv fileCsv file with sameformat

Synthtetic data generationK parameter

* https://rd.springer.com/chapter/10.1007/978-3-319-64468-4_26, Tarek Benkhelif,Françoise Fessant, Fabrice Clérot, Guillaume Raschia

Page 23: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 23interne Orange23

Example Orange Anonymization Tools

Raw dataDictionary of data format

Privacy parameter depending on the use case (ex 200)

Co-clustering file

Page 24: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 24

Anonymization API: Further Research AreasSome propositions but no solution for every usage and no privacy risks

-> Data Anonymization and Re-identification Competition (PETS 2018)

Anonymization is a process, need to update with the state of the art: Example Airclock with query based system with Diffix algorithm, and recent article with a proposed attack*

Some use case need a reversible technology (fraud management, customer care, national security…)

* https://arxiv.org/abs/1804.06752, Andrea Gadotti, Florimond Houssiau, Luc Rocher, Yves-Alexandre de Montjoye

Page 25: Data Anonymizing API – Phase II · Amongst other things Paul is the Global Ambassador for the TMForum with responsibility for Big Data Analytics and Customer Experience Management.

© 2018 TM Forum | 25© 2018 TM Forum | 25

THANK YOU