Post on 01-Jun-2022
Chair of Software Engineering for Business Information Systems (sebis)
Faculty of Informatics
Technische Universität München
wwwmatthes.in.tum.de
Investigating the Application of Differential Privacy to Mitigate
Privacy Issues in Natural Language ProcessingStephen Meisenbacher, 07.12.2020, Guided Research Kick-Off Presentation
Introduction
Background
Motivation
Goals
Research Questions
Methodology
Initial Results
Source Collection
Interviews
Next Steps
Timeline
Outline
© sebis 2201207 Meisenbacher Guided Research Kick-Off Presentation
Background #1 – How do People View Data Privacy?
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 3https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-
confused-and-feeling-lack-of-control-over-their-personal-information/
Background #2 – The Rise of Big Data
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 4
https://www.forbes.com/sites/louiscolumbus/2018/05/23/10-charts-that-will-change-your-perspective-
of-big-datas-growth/https://link.springer.com/article/10.1007/s11192-020-03371-2
Evolutionary trend in the number of publications covering data
science and big data
Background #3 – Data Breaches and the Reaction
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 5
https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-
and-records-exposed/
Background #4 – Privacy Going Forward
“Data Privacy Will Be The Most Important Issue In The Next Decade”
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 6
https://www.forbes.com/sites/marymeehan/2019/11/26/data-privacy-will-be-the-most-important-issue-in-
the-next-decade/
https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-
confused-and-feeling-lack-of-control-over-their-personal-information/
Publications with “Privacy” in Title or Abstract, 2000-2020
https://app.dimensions.ai/analytics/publication/overview/timeline?search_mode=content&search_text=p
rivacy&search_type=kws&search_field=text_search&year_from=2000&year_to=2020&local:indicator-
y1=timeline-source-published
Motivation – Differential Privacy
• Context: using data in learning tasks
• Differential Privacy key concept: individual changes to
the data still preserves privacy
i.e. no new information can be learned from these
minute changes
Participation of individual cannot be determined
• Goal: difference between output of query/algorithm
with a single change is bounded
Bound can be controlled / quantified (ε)
• Not an algorithm – more of a guideline, schematic
• Provides privacy guarantee
• Advantages: composability, robustness regardless of
attack type
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 7
Motivation – why Differential Privacy + Natural Language Processing?
• Privacy in general is a hot topic
Coincides with the increasing importance of data
What if we could provide some sort of privacy guarantee?
• Result: novel privacy-preserving techniques
• Differential privacy: a promising and relatively new concept
Still very much in the theoretical phase
• NLP: usually dependent on user (human) data potential privacy concerns
• A great deal of papers address Differential Privacy in regards to Machine Learning or Deep Learning
But – very little specific to NLP
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 8
Goals
• Want: overview of the current state of DP in NLP
• Privacy vulnerabilities
• Feasibility
• Technical applications
• Use cases?
• Pros, cons
• Overall: current work + potential
• Method: systematic literature review
• Academic research/literature
• “Grey” literature
• Contact with experts in the field
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 9
Research Questions
1. What vulnerabilities to current NLP techniques is Differential Privacy capable of preventing?
2. What are the foundations of Differential Privacy, and how can it be applied to NLP tasks?
3. What are the distinct benefits and limitations of applying Differential Privacy to NLP tasks?
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 10
Methodology
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 11
Initial Results – Source Collection
• Initial search strings:
• ‘Differential Privacy (Natural Language Processing | NLP)’
• ‘Privacy (Natural Language Processing | NLP)’
• ‘Differential Privacy’
• Electronic Data Sources
• IEEE Xplore
• ACM Digital Library
• Google Scholar
• ScienceDirect
• Springer
• Wiley
• Google (for grey literature)
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 12
Garousi, et al. “Multivocal Literature Reviews”
Initial Results – Source Collection (cont.)
100s• Search
Results (unfiltered)
~20• Filtered based
upon title/abstract
60• Manual search
through references in initial documents
??• Final Set of
documents (tbd)
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 13
Initial Results – Source Collection (cont.)
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 14
Multi-word Keyword Extraction
app.sketchengine.eu
~340,000 word corpus
0
2
4
6
8
10
12
14
16
18
20
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
# E
lem
en
ts
Year
Documents in Catalog by Year
Initial Results – Interviews: Organization and Methodology
• Manual search for expert contacts
Look for “privacy” + “NLP” in research interests
• Contacted via email
8 emails sent
• So far:
5 responses
3 interviews scheduled
2 interviews conducted
Both researchers who deal specifically with privacy and NLP
Hope: a few more interviews over time
• Format: ~30 minute video interview
• Prepared interview questionnaire, broken down to 4 categories:
General: current work, background with privacy, thoughts on privacy + NLP
RQ1: NLP privacy vulnerabilities, attack types, preventative work so far
RQ2: DP foundations, application to NLP, use cases, technical implementations
RQ3: major advantages, current limitations, future improvement, thoughts on future of private NLP
Total: 18 questions + sub-questions
• Main goal: obtain a strong background/motivation before diving into the literature review
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 15
Initial Results – Interviews: (Some) Takeaways So Far
1. The nature of NLP inherently creates privacy concerns
Reliant on human data language is our main source of communication
Textual information is “rich in content” (potentially private/sensitive)
Vulnerable to certain attack types:
Membership inference, attribute inference, keyword inference, pattern reconstruction
Several use cases bolster this point:
Models trained on private messages can unintentionally memorize sensitive data1
Profiling fake news / hate speech spreaders from stylometry2
Authorship verification / profiling / clustering (e.g. from follower tweets) 2
Hyperpartisan news detection2
*Gender identification from tweets2,3,6
*Personally identifiable search log text4
*Genome prediction using pattern reconstruction5
*Extracting disease keywords from BERT embeddings5
* (Metric) Differential Privacy can be used to mitigate vulnerabilities
RQ1 goal
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 16
1 Carlini, et al. “The Secret Sharer: Evaluating and Testing
Unintended Memorization in Neural Networks”2 https://pan.webis.de/publications.html3 Koppel, et al. “Automatically Categorizing Written Texts
by Author Gender” 4 Li, et al. “Towards Robust and Privacy-preserving Text
Representations” 5 Pan, et al. “Privacy Risks of General-Purpose Language
Models”6 Fernandes, et al. “Author Obfuscation using Generalised
Differential Privacy”
Initial Results – Interviews: (Some) Takeaways So Far (cont.)
2. Features of DP make it an attractive option (“the best”)
Privacy guarantee (see 4.)
Relatively efficient (e.g. vs. homomorphic encryption)
Composability + robustness
3. However, DP is not a silver bullet to addressing privacy concerns in NLP
Must consider the task at hand
Also consider the privacy-utility tradeoff
Use DP: making generalizations/predictions on publicly visible data
Other privacy-preserving techniques might sometimes make more sense
4. Biggest challenge/limitation: Explainability
+ giving an epsilon value to an engineer – simple
- NLP is fuzzy, unstructured
- What does privacy mean for NLP (or in general)?
How can we explain DP in the frame of NLP???
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 17
Initial Results – Next Steps
1. Conduct next interviews
Compile and transcribe results
Possible: find new contacts
2. Finalize catalog of literature
Filter down to << 60
3. Begin full literature review
Mark up and take notes
Follow Thematic Synthesis Process:
4. Concurrently, begin writing
© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 18
Dyba, et al. “Applying Systematic Reviews to Diverse Study Types: An Experience Report”
Timeline
201207 Meisenbacher Guided Research Kick-Off Presentation 19© sebis
Stephen Meisenbacher Guided Research Schedule
Project Start Date Display Week 4
Project End Date
19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
WBS TASK START END DAYS%
DONE
WORK
DAYSM T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S
1 Initial Stages Fri 10/02/20 Fri 10/30/20 29 21
1.1 Initial meeting with Alexandra Fri 10/02/20 Fri 10/02/20 1 100% 1
1.2 Initial planning + meeting prep Sat 10/03/20 Fri 10/16/20 14 100% 10
1.3 Prep meeting, review and finish slides Mon 10/19/20 Wed 10/21/20 3 100% 3
1.4 Meeting with Prof. Matthes Thu 10/22/20 Thu 10/22/20 1 100% 1
1.5 Revision to methodology, etc. Fri 10/23/20 Fri 10/30/20 8 100% 6
2 Data Collection Mon 11/02/20 Mon 12/21/20 50 36
2.1 Source collection, cataloging Mon 11/02/20 Fri 11/20/20 19 100% 10
2.2Source review, data collection and
notationMon 11/16/20 Mon 12/21/20 36 0% 26
2.3 Seminar initial presentation Mon 12/07/20 Mon 12/07/20 1 0% 1
2.4 Conduct interviews Mon 11/16/20 Fri 12/11/20 26 0% 20
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14
M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S
3 Writing and Presentation Mon 1/04/21 Mon 4/12/21 99 71
3.1 Writing of review draft Mon 1/04/21 Fri 2/12/21 40 0% 30
3.2 Editing and revisions Mon 2/15/21 Fri 2/26/21 12 0% 10
3.3 Final presentation prep Mon 3/01/21 Fri 3/05/21 5 0% 5
3.4 Final presentation Mon 3/08/21 Mon 3/08/21 1 0% 1
3.5 Submit final paper Fri 3/12/21 Fri 3/12/21 1 0% 1
3.6 Buffer Sat 3/13/21 Mon 4/12/21 31 0% 21
Week 20
8 Feb 2021
Week 21
15 Feb 2021
Week 17
18 Jan 2021
Week 18
25 Jan 2021
Week 19
1 Feb 2021
Week 15
4 Jan 2021
Week 16
11 Jan 2021
Week 12
14 Dec 2020
Week 13
21 Dec 2020
Week 7
9 Nov 2020
Week 11
7 Dec 202016 Nov 2020
Week 9
23 Nov 2020
Week 8 Week 10
30 Nov 2020
Week 6
2 Nov 20204/12/2021 (Monday)
Week 5Week 410/1/2020 (Thursday)
26 Oct 202019 Oct 2020
Week 22
22 Feb 2021
Week 23
1 Mar 2021
Week 24
8 Mar 2021
Technische Universität München
Faculty of Informatics
Chair of Software Engineering for Business
Information Systems
Boltzmannstraße 3
85748 Garching bei München
Tel +49.89.289.
Fax +49.89.289.17136
wwwmatthes.in.tum.de
Stephen Meisenbacherstephen.meisenbacher@tum.de
17132
matthes@in.tum.de
Appendix A: Grey Literature (Garousi)
201207 Meisenbacher Guided Research Kick-Off Presentation 21© sebis
Appendix B: Quality Checklist (Garousi)
201207 Meisenbacher Guided Research Kick-Off Presentation 22© sebis
Appendix C: Search Process Documentation (Kitchenham)
201207 Meisenbacher Guided Research Kick-Off Presentation 23© sebis
Appendix D: Thematic Synthesis Process (Dyba)
201207 Meisenbacher Guided Research Kick-Off Presentation 24© sebis