Post on 22-Oct-2021
Data Anonymization
Sara Szoc, CrossLangWorkshop
Introduction
Data Anonymization
• Concept
• Methods
• Risks
• Practical tips
What is data anonymization
What ?
• Process of removing private or confidential information from raw data
• Results in anonymous data that cannot be associated with any individual or company
Why ?
• Protection of identity and private activities
• Financial aspect
How ?
• Using anonymization technique(s)
• Selection and assessment based on use case
PersonalData
Personal or identifiable data:
Information that can lead to the identification of an individual (or a group of individuals)
• Direct identifiersperson/company name, surname, email addresscontaining name, phone number, id card/socialsecurity number, medical record number …
• Indirect identifiersdate of birth, gender, zipcode can uniquelyidentify about 80% of the US population
• Pseudonymous or encrypted datacan be used to re-identify a person and thus remains personal data
PersonalData
“Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data.
For data to be truly anonymised, the anonymisation must be irreversible.”
(source: General Data Protection Regulation)
SensitiveData
• Sensitive personal data• can cause harm or embarrassment to the
individual
• for limited dissemination onlyracial/ethnic origin, political/religious beliefs, genetic data, biometric data (fingerprints), health information, sexual orientation … (GDPR)
• Sensitive business information• poses a risk to the company in question if
discovered trade secrets, acquisition plans, financial data, supplier and customer information
Structuredversus
unstructureddata
• Structured data• stored in a structured way
• easily searchable
• relational databases, spreadsheets, data in formats such as JSON, XML, CSV …
• Unstructured data• anything else
• difficult to search
• text files, reports, email messages, audio files, images …
Anonymizationmethods
suppression
masking
Before anonymization
After anonymization
Anonymizationmethods
classification
Before anonymization
After anonymization
Anonymizationmethods
Name Age Location Illness
Luke 39 Belgium Flu
Ashley 57 Belgium Multiple Sclerosis
John 81 Germany Lung cancer
Roman 72 Germany Multiple Sclerosis
perturbation
swapping
Name Age Location Illness
John 40 Brussels Flu
Ashley 56 Antwerp Multiple Sclerosis
Luke 80 Berlin Lung cancer
Roman 71 Munchen Multiple Sclerosis
generalization
Pseudonymization
• Reversible process by using a key
• Still to be treated as personal data because enables re-identification
Name Pseudonymized Anonymized
John q0fdGL xxxxx
Ashley s8fhPd xxxxx
Luke EiuD5j xxxxx
Roman qOerd xxxxx
Luke EiuD5j xxxxx
Measuringanonymization
and risks
• K-anonymity, Differential privacy
• Focus on structured data
Gender Age Location Illness
male 40-50 Belgium Flu
male 40-50 Belgium Multiple Sclerosis
female >50 Germany Lung cancer
female >50 Germany Multiple Sclerosis
2-anonymous data
Existing tools
• Tools for structured data• ARX
• Cornell Anonymization Toolkit
• Tools for unstructured data• MITRE Identification Scrubber Toolkit (MIST)
• Natural Language processing tools (e.g.OpenNLP or Stanford CoreNLP NamedEntity Recognizers)
Practical tips (conclusions)
There is no “one fits all solution”, but different factors need to be taken intoconsideration:
• Analyze nature of data
• Analyze recipients
• Analyze risks (de-anonymization risk management)
• Analyze data utility
• Run anonymization process insideorganization