What if Computers Understood Privacy Policies? A …...2019/05/30 · USABLE PRIVACY POLICY AND...
Transcript of What if Computers Understood Privacy Policies? A …...2019/05/30 · USABLE PRIVACY POLICY AND...
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 1Usable Privacy Policy Project
HONG KONG UNIVERSITY EXPERT ADDRESS
Copyright © 2019, Norman Sadeh
What if Computers Understood
Privacy Policies?
A Look at Advances in NLP through
the Lens of Privacy
Norman Sadeh
Carnegie Mellon University
www.normsadeh.org
usableprivacy.org privacyassistant.org
explore.usableprivacy.org
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 2
What is Privacy?• Moral right of individuals to be left
alone, free from surveillance or
interference from other individuals or
organizations, including state
– There are obviously conflicting
considerations
• e.g. security and safety
• Legal Protection: founding documents of
many countries
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 3
Data Privacy
• The claim that certain information should not be collected by governments, businesses or other entities– or possibly only under special circumstances and subject to various rules
– Hong Kong Data (Privacy) Ordinance
– EU General Data Protection Directive
– US Children Online Privacy Protection Act (COPPA), etc.
– etc.
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 4
Privacy Policies
• Disclose an entity’s
data practices –
“notice and choice”
• In practice: long,
complex, vague and
ambiguous
• Hardly anyone
reads them
Privacy Policy
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 5
Yet People Care About Privacy
• Vast majority of people care about privacy
• 91% of people in the US report feeling they
have lost control over their informationPew Survey 2014 http://www.pewinternet.org/2014/11/12/public-privacy-perceptions/
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 6
One Size Fits All Doesn’t Work
Attempts to summarize privacy policies have
been shown to have limitations
– Either, the summary is too long and people
still don’t read it
– …Or the summary is too short and people fail
to understand critical issues
– Different people care about different issues
• Different concerns, different expectations, what they
know/don’t know is different
J. Gluck, F. Schaub, A. Friedman, H. Habib, N. Sadeh, L.F. Cranor, Y. Agarwal, "How Short is Too Short?
Implications of Length and Framing on the Effectiveness of Privacy Notices", Symposium on Usable Privacy
and Security (SOUPS '16), Denver, CO, Jun 2016
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 7
What if…
• Computers understood the text of privacy
policies?
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 8
Annotation Tool Select a category
S. Wilson, F. Schaub, A. Dara, F. Liu, S. Cherivirala, P.G. Leon, M.S. Andersen, S. Zimmeck, K. Sathyendra, N.C.
Russell, T.B. Norton, E. Hovy, J.R. Reidenberg, N. Sadeh, "The Creation and Analysis of a Website Privacy Policy
Corpus", ACL '16: Annual Meeting of the Association for Computational Linguistics, Aug 2016
Select an attribute
Select a value
Highlight text span for an attribute,
value pair
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 9
OPP 115 Privacy Policy Corpus
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 10
Interpreting Annotations
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 11
A First Task: Segment Annotation
Machine Learning
Model
Predict
Disclosure of Your Information Sci-News.com
does not sell, trade or rent your personal
information to third parties. If we choose to do
so in the future, you will be notified by email of
our intentions, and have the right to be
removed prior to the disclosure.
This policy segment discusses:
• Third Party Sharing/Collection
Privacy Policy
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 12
A Number of Possible Classifiers
• Traditional Methods
– Bag of N-grams as features
– Multinomial Naïve Bayes (MNB)
– Logistic Regression (LR)
– Support Vector Machines (SVM)
• Neural Methods
– One-hot vector as input
– Recurrent Neural Networks (RNNs)
– Convolutional Neural Networks (CNNs)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 14
Performance (Precision/Recall/F1)
Simple techniques are not easy to beat
Picking a techniques purely based on F1 metric may be simplistic
Could use different techniques for different categories – or just LR for all
Performance strongly impacted by number of instances available for training
Precision: How often am I correct?; Recall: What percentage of instances do I catch?; F1:
combines both
Shomir Wilson, Florian Schaub, Frederick Liu, Kanthashree Mysore Sathyendra, Daniel Smullen, Sebastian Zimmeck, Rohan
Ramanath, Peter Story, Fei Liu, Norman Sadeh, Noah A. Smith, "Analyzing Privacy Policies at Scale: From Crowdsourcing
to Automated Annotations", ACM Transactions on the Web, 13, 1, Dec 2018 [pdf]
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 15
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 16
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 17
Another Task: User Choice Instance Extraction
• User choices often buried
deep in the text of long policies
• Is it possible to automatically
extract information about
such “choice instances” from
privacy policies?
• Use Natural Language Toolkit
tokenizer to subdivide
segments into sentences &
build classifiers
Choice Instance !!!
If you do not want us to use
personal information that we
gather to allow third parties to
personalize advertisements
we display to you, please
adjust your Advertising
Preferences .
K.M. Sathyendra, S. Wilson, F. Schaub, S. Zimmeck, N. Sadeh. Identifying the Provision of Choices in Privacy Policies, EMNLP Conference,
2017
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 18
Privacy Choices
• A number of
choices:
– De-activate account
– Delete account
– Opt-in of some practice
(e.g. collection of
location)
– Opt-out (e.g., email,
cookies, etc.)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 19
Privacy Choices in OPP-115 Corpus
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 20
Initial Results (opt-out links)
Better results since then: Precision of 0.93 and recall of 0.98
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 21
Annotated 7,000+ policies
https://explore.usableprivacy.org/
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 22
Word Embeddings - I• Models learnt from raw text are brittle – fail to
exploit similarities between words
• ”A word is characterized by the company it keeps”
– John Firth – collocational meaning
• Distributional Semantics: capture semantic
similarities between words based on distributional
properties in large corpora of documents
• Different contexts – e.g. Apple the fruit vs. Apple the
company; or browser cookie vs. chocolate chip
cookieVinayshekhar Bannihatti Kumar, Abhilasha Ravichander, Peter Story, and Norman Sadeh, "Quantifying the Effect of In-
Domain Distributed Word Representations: A Study of Privacy Policies", AAAI Spring Symposium on Privacy Enhancing
AI and Language Technologies (PAL 2019), Mar 2019 [pdf]
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 23
Word Embeddings - II• Two styles of word embeddings: word represented
as either:
– Vectors of co-occurring words
– Vectors of linguistic contexts
• These days, these models are typically trained using
neural nets – rather than traditional n-gram models.
– Unsupervised learning
• Do domain-specific embeddings help improve
performance over generic embeddings?
– GloVe: generic embeddings
– Word2Vec privacy-specific embeddings we trained –
using 150,000 mobile app privacy policies (Google
Play Store)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 24
OPP-115: Number of Instances
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 25
Performance Improvement Using
Domain-Specific Embeddings
• Comparison between generic GloVe embeddings and
Word2Vec embeddings trained on large corpus of
privacy policies
• Significant improvements in categories with small
number of instances (OPP-115)
Data Practice Category Relative Percentage gain in F1
performance (Dev set)
Data Security 9%
Policy Change 6%
Data Retention 10%
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 26
Visualizing Word Embeddings
Note also the different scales!
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 27
How many privacy policies do we
need?
27
Varies based on data practice – finer data practices might benefit from greater corpus
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 28
Deep Contextualized Word
Embeddings (“BERT”)
[ Devlin et al. 2017]
• BERT is a transformer model trained
on the Book Corpus and Wikipedia.
• BERT has contextualised word
embeddings which ensure that a word
in different contexts has different
embeddings.
• BERT is the state of the art model on
most classification tasks.
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 29
BERT vs. in-domain word embeddings
Data Practice BERT In-Domain
Embedding
1st Party Collection & Use
3rd Party Sharing & Collection
User Choice Control
Data Security
International and Specific Audiences
Access, Edit & Delete
Policy Change
Data Retention
Do Not Track
Some further improvements, though domain-specific embeddings continue to
provide value for practices with small number of instances
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 30
Mobile App Privacy Compliance
• Millions of apps in Google Play and iOS/iTunes app
stores
• These apps access a large number of sensitive APIs
(e.g., location, calendar, camera)
• Most developers lack the resources and know-how to
ensure that their apps are compliant
• Regulations are changing (e.g. GDPR, CCPA)
• Manually vetting apps does not scale: In 2014, the
Global Privacy Enforcement Network assessed 1211
apps in one week
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 31
Can We Automatically Check for Potential
Compliance Issues?
• Training machine learning classifiers to extract relevant policy statements
• Compare these statements against:
– Regulatory requirements
– What the software actually does
• Static and dynamic code analysis
S. Zimmeck, Z. Wang, L. Zou, R. Iyengar, B. Liu, F. Schaub, S. Wilson, N. Sadeh, S.M. Bellovin, J.R. Reidenberg, "Automated
Analysis of Privacy Requirements for Mobile Apps", NDSS'17: Network and Distributed System Security Symposium, Feb
2017 [pdf]
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 32
Performance metrics of our
classifiers’ ability to determine
whether a privacy policy
states that a practice is
performed, calculated using
the held-out test set (n = 100).
+ number of ground-truth instances
(cases where policies truly describe the
practice being performed)
- number of negative ground-truth
instances (cases where policies truly
don’t describe the practice being
performed).
NPV stands for negative predictive
value, or the precision for negative
instances.
Specificity is the recall for negative
instances. Negative F1 is the F1 for
negative instances.
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 33
Performance metrics
of our static
analysis’s ability to
detect the practices
in our app test set
(n = 100). Dynamic
analysis used as
ground truth
(conservative).
+ number of positive
ground truth instances
- number of negative
ground truth instances
? number of instances
where either the
ground truth is
unknown or the manual
analysis failed.
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 34
Scaling our Analysis
• We analyzed as many free apps as we
could find on the US Google Play Store
• Successfully analyzed 1,035,853 free
apps
Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N. Cameron Russell,
and Norman Sadeh, "Natural Language Processing for Mobile App Privacy Compliance", AAAI Spring Symposium on
Privacy Enhancing AI and Language Technologies (PAL 2019), Mar 2019 [pdf]
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 35
Findings: Number of Potential
Compliance Issues
•Average number of
potential compliance
issues per app is 3.47
and the median is 3
•Note: “No Policies”
apps are excluded from
following figures
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 36
•Location and
certain Identifiers
are quite
common
•3rdParty are
more common
than 1stParty
Findings: Prevalence of Practices and
Potential Compliance Issues
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 37
• Lighter colors indicate greater transparency of practices. Darker colors
indicate that practices are being performed but not disclosed.
• Cells with fewer than 25 apps performing the practice are annotated with
the respective number of apps.
• Third party libraries: transparency issues
• Greater transparency for FAMILY categories, but still not great.
Findings: Play Store Categories
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 38
Applications
• COPPA report compiled for FTC
• Focusing on location, apps with a large
number of downloads, and companies based
in the US
• Work with large European electronics
manufacturer – checking for GDPR
compliance of mobile apps
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 39
Reports View with Search and
FiltersSearch and Filter
according to
- Practices
(e.g., GPS
location
collection)
- Type
(Is GPS
location not
mentioned in
the policy or
does the
policy explicitly
state it is not
occurring?
- Specificity
(Perhaps
“location”
suffices in
some
jurisdictions
not specifically
requiring to
mention
“GPS.”)
- Many more …
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 40
Flashlight App Metadata Details
All available app metadata is
extracted from the Google Play
Store
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 41
Flashlight App Policy ResultsRelevant parts of the policies are
extracted and displayed
alongside the analysis results
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 42
Flashlight App Static Analysis
ResultsRelevant parts of the
code are extracted
and displayed
alongside the
analysis results
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 43
Answering People’s Privacy
Questions• Generating summaries of
privacy policies has significant
limitations
• Different people are interested
in different sets of issues
• How about developing
functionality capable of
answering people’s privacy
questions?
”One Size
Fits All”
does not
work
Abhilasha Ravichander, Alan Black, Eduard Hovy, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh, "Challenges
in Automated Question Answering for Privacy Policies", AAAI Spring Symposium on Privacy Enhancing AI and
Language Technologies (PAL 2019), Mar 2019 [pdf]
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 44
Research Questions• What types of questions do people have?
• How do people formulate their privacy questions?
– Are they able to articulate their questions?
• Do their questions make sense?
• Are there questions ambiguous?
• Do their questions require clarifications?
• What types of answers do people find useful?
– Content, length, assumptions about what people
understand/know
• Can questions be answered based on the text of
privacy policies?
– If not, what other sources of information are available?
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 45
Collecting a Corpus of Privacy Questions
Crowdsourcing privacy questions about mobile apps
Scenario: Asked to assume there is a privacy assistant capable of answering
their privacy questions
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 46
Crowdsourcing Task – Amazon Turk
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 47
Legal Analysis
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 48
Data Collected• 10 mobile app categories – app categories with
more than 2% of total number of apps
• Total of 27 mobile apps – 50% with over 5M
downloads and 50% with fewer downloads
• 50 questions per policy – 5 crowdworkers for
each app and 10 questions per crowdworker:
total of 1,350 questions
• Avg. question length: 8.4 words
• Avg. policy length: 3,273 words
• Avg. answer length: 104.5 words
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 49
Types of Questions – Word Analysis
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 50
Relevance and Subjectivity
Privacy Related Not Privacy Related
Subjective 4.9%
“Is my data safe?”
1.4%
“Are there any in-game
purchases I should be
concerned about?”
Objective 74%
“What information are
they collecting?”
19.7%
“How much does it cost?”
Not hopeless….but not trivial either…
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 51
Data Practice Categories
Relatively well aligned with distribution of practice statements in privacy policies…
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 52
What Makes Questions Unanswerable? (I)
• 56% of unanswerable questions would typically not be
addressed in the text of privacy policies
• “How does the currency within the game work?” ---- out of
scope
• “Has Viber had data breaches in the past?” ---- not typically
disclosed in a privacy policy….would have to look for
other sources of information (e.g. news sites, social media,
etc.)
• 24% of unanswerable questions should ideally be
answerable based on the text of the privacy policy but policy
was silent (e.g., “Is my data encrypted?”)
– Public policy implications
– Could potentially fall back on background knowledge
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 53
• 6% of unanswerable questions are too
vague to be correctly interpreted
– “Who can contact me through the app?”
– Would benefit from dialogue to clarify what the
person is asking
• 4% are ambiguously phrased
– “Any difficulty to occupy the privacy
assistant?”
– Dialogue could help too
What Makes Questions Unanswerable? (II)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 54
• 3% of unanswerable questions are too
specific to typically be addressed in
privacy policy
– “Does it have access to financial apps I use?”
– Could potentially fall back on general
knowledge
• 7% are subjective
– “How do I know this app is legit?”
– Could potentially fall back on general
knowledge
What Makes Questions Unanswerable? (III)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 55
Observations
• Privacy Q&A could be more effective than one-size-fits-
all summaries
• Privacy policies are underspecified
– Would benefit from accessing other sources of
information
• Some questions can automatically be answered
• Some questions will require dialogues with users to
disambiguate what the user is interested in
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 56
Ongoing Work on Privacy Q&A
• Building classifiers to determine if question is
answerable and, if it is not, why (e.g.
differentiating between subjective questions,
questions that require disambiguation, questions
that show lack of knowledge, questions that
would benefit from others sources of
information)
• Building answer templates – incl.
qualifications/disclaimers, background
knowledge, etc.
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 57
Concluding Remarks - I• Data-centric economy: Impossible for
people to keep up with all the different
ways in which their data is collected & used
• Privacy policies are required by a number
of regulations around the world, but people
can’t be expected to read them
• Advances in NLP/ML make it possible to
develop solutions that automatically
extract statements from the text of
policies
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 58
Concluding Remarks - II
• Applications
– Tools to help people more effectively navigate
the text of privacy policies
– Tools to automatically extract Policy
Choices buried in the text of policies (e.g. opt
outs)
– Tools to help regulators, app stores and app
developers with privacy compliance issues
– Tools to help answer those privacy
questions a given individual cares about
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 59
Q&A
Acknowledgements: Work funded by the National
Science Foundation, DARPA and Google
The Usable Privacy Policy Project and the
Personalized Privacy Assistant Project both involve a
collaborations with a number of individuals.
See usableprivacy.org and privacyassistant.org for
additional details incl. lists of collaborators and
publications
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 60
Six Principles:
1. Purpose & Manner of Collection has to be disclosed to data subject
2. Accuracy and Duration of Retention of Personal Data: data should be uptodate and only retained as long as necessary
3. Use of Personal Data: only for the purpose for which data was collected – unless otherwise agreed by data subject
4. Security of Personal Data: protection against unauthorized or accidental access, processing or deletion
5. Notification: Open policies about data being collected & for what purpose
6. Access to personal data: right to review and correct data about oneself
Hong Kong Personal Data Ordinance (Dec. 1996)
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 61
Hong Kong Personal Data Ordinance
• Personal data can only be used for the purpose for which it was collected – no frivolous collection
– This also restricts sharing
• Purpose has to be stated from the beginning
• People should have the right to inspect information held about them within 40 days of their asking
– May involve a fee
• Data has to be corrected if erroneous
• Data has to be secure
• No direct marketing or teleselling if someone opts out
• Individuals can sue if damage results from the release of confidential data, or from inaccurate data or other breach
• Note: This is a very approximate summary – read the text of the Ordinance for a more detailed & accurate understanding
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 62
Recent Amendments to HK Personal Data Ordinance
• Effective April 1, 2013
• Imposes additional requirements for data users that seek to:
– Sell personal data
– Use personal data for their own direct marketing purposes
– Provide personal data to another person for that other person’s direct marketing purposes
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 63
More Details
USABLE PRIVACY POLICY AND PERSONALIZED PRIVACY ASSISTANT PROJECTS 64
EU – GDPR
• Took effect May 25, 2018
• Stricter provisions than EU Data Protection Directive
– Single set of rules and one-stop shop model: each company coordinates with a single Supervisory Authority
– Privacy by Design and by Default – incl. default privacy settings that are protective
– Opt-in: data controllers must be able to prove consent & consent may be withdrawn
– Severe penalties for violations: up to 20MEUR or 4% of worldwide turnover, whichever is greater
– Right to “erasure”
– Data portability