Multimedia Privacy
-
Upload
symeon-papadopoulos -
Category
Data & Analytics
-
view
105 -
download
0
Transcript of Multimedia Privacy
Multimedia PrivacyGerald FriedlandSymeon PapadopoulosJulia BerndYiannis Kompatsiaris
ACM Multimedia, Amsterdam, October 16, 2016
What’s the Big Deal?
Overview of Tutorial
• Part I: Understanding the Problem• Part II: User Perceptions About Privacy• Part III: Multimodal Inferences• Part IV: Some Possible Solutions• Part V: Future Directions
Part I: Understanding the Problem
What Can a Mindreader Read?• This vulnerability is a problem with any type of
public or semi-public post. They’re not specific to a particular type of information, e.g. text, image, or video.
• However, let’s focus on multimedia data: images, audio, video, social media context, etc.
Multimedia on the Internet Is Big!
Source: Domosphere
Resulting Problem• More multimedia data = Higher demand for
retrieval and organization tools.• But multimedia retrieval is hard!
• Researchers work on making retrieval better (cf. latest advances in Deep Learning for content-based retrieval).
• Industry develops workarounds to make retrieval easier right away.
Hypothesis
• Retrieval is already good enough to cause major issues for privacy that are not easy to solve.
• Let’s take a look at some retrieval approaches:• Image tagging• Geo-tagging• Multimodal Location Estimation• Audio-based user matching
Workaround: Manual Tagging
Workaround: Geo-Tagging
Source: Wikipedia
Geo-Tagging
Allows easier clustering of photo and video series, among other things.
Geo-Tagging EverywherePart of the location-based service hype:
But: Geo-coordinates + Time = Unique ID!
Support for Geo-Tags• Social media portals provide APIs to connect geo-
tags with metadata, accounts, and web content.
• Allows easy search, retrieval, and ad placement.
Portal %* Total
YouTube 3.0 3M
Flickr 4.5 180M
*estimate (2013)
Hypothesis
• Since geo-tagging is a workaround for multimedia retrieval, it allows us to peek into a future where multimedia retrieval works perfectly.
• What if multimedia retrieval actually just worked?
Related Work
“Be careful when using social location sharing services, such as Foursquare.”
Related Work
Mayhemic Labs, June 2010: “Are you aware that Tweets are geo-tagged?”
Can you do real harm?• Cybercasing: Using online (location-based) data and
services to enable physical-world crimes.
• Three case studies:
G. Friedland and R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Proceedings of the Fifth USENIX Workshop on Hot Topics in Security (HotSec 10), Washington, D.C, August 2010.
Case Study 1: Twitter• Pictures in Tweets can be geo-tagged
• From a tech-savvy celebrity we found:• Home location (several pics)• Where the kids go to school• Where he/she walks the dog• “Secret” office
Celebs Unaware of Geo-Tagging
Source: ABC News
Celebs Unaware of Geotagging
Google Maps Shows Address...
Case Study 2: Craigslist
“For Sale” section of Bay Area Craigslist.com:• 4 days: 68,729 pictures total - 1.3% geo-tagged
Users Are Unaware of Geo-Tagging
• Many “anonymized” ads had geo-location • Sometimes selling high-value goods, e.g. cars,
diamonds, etc.• Sometimes “call Sunday after 6pm”• Multiple photos allow interpolation of coordinates
for higher accuracy
Craigslist: Real Example
Geo-Tagging Resolution
Measured accuracy: +/- 1m
iPhone 3G picture Google Street View
What About Inference?
Owner
Valuable
Case Study 3: YouTube
Recall:• Once data is published, the Internet keeps it (often
with many copies).• APIs are easy to use and allow quick retrieval of
large amounts of data.
Can we find people on vacation using YouTube?
Cybercasing on YouTubeExperiment: Cybercasing using the YouTube API (240 lines in Python)
Cybercasing on YouTube
Input parametersLocation: 37.869885,-122.270539Radius: 100kmKeywords: kids Distance: 1000km Time-frame: this_week
Cybercasing on YouTubeOutput• Initial videos: 1000 (max_res)
• User hull: ~50k videos • Vacation hits: 106 • Cybercasing targets: >12
The Threat Is Real!
Question
Do you think geo-tagging should be illegal?a) No, people just have to be more careful. The
possibilities still outweigh the risks.b) Maybe it should be regulated somehow to make
sure no harm can be done.c) Yes, absolutely! This information is too
dangerous.
But…
Is this really about geo-tags?
(remember: hypothesis)
But…
Is this really about geo-tags?
No, it’s about the privacy implications of multimedia retrieval in general.
QuestionAnd now? What do you think should be done? a) Nothing can be done. Privacy is dead.b) I will think before I post, but I don’t know that it
matters.c)We need to educate people about this and try to save privacy. (Fight!)
d) I’ll never post anything ever again! (Flight!)
Observations• Many applications encourage heavy data sharing,
and users go with it.• Multimedia isn’t only a lot of data, it’s also a lot of
implicit information.• Both users and engineers often unaware of the
hidden retrieval possibilities of shared (multimedia) data.
• Local anonymization and privacy policies may be ineffective against cross-site inference.
Dilemma• People will continue to want social networks and
location-based services. • Industry and research will continue to improve
retrieval techniques.• Government will continue to do surveillance and
intelligence-gathering.
Solutions That Don’t Work• I blur the faces
•Audio and image artifacts can still give you away
• I only share with my friends •But who are they sharing with, on what platforms?
• I don’t do social networking •Others may do it for you!
Further Observations
• There is not much incentive to worry about privacy, until things go wrong.
• People’s perception of the Internet does not match reality (enough).
Basics: Definitions and Background
Definition
• Privacy is the right to be let alone (Justices Warren and Brandeis)
• Privacy is:a) the quality or state of being apart from company
or observationb) freedom from unauthorized intrusion
(Merriam Webster’s)
Starting Points
• Privacy is a human right. Every individual has a need to keep something about themselves private.
• Companies have a need for privacy.
• Governments have a need for privacy (currently heavily discussed).
A Taxonomy of Social Networking Data
• Service data: Data you give to an OSN to use it, e.g. name, birthday, etc.
• Disclosed data: What you post on your page/space• Entrusted data: What you post on other people’s
pages, e.g. comments• Incidental data: What other people post about you• Behavioural data: Data the site collects about you• Derived data: Data that a third party infers about you
based on all that other data
B. Schneier. A Taxonomy of Social Networking Data, Security & Privacy, IEEE, vol.8, no.4, pp.88, July-Aug. 2010
Privacy Bill of Rights
In February 2012, the US Government released
CONSUMER DATA PRIVACY IN A NETWORKED WORLD:A FRAMEWORK FOR PROTECTING PRIVACY AND PROMOTING
INNOVATION IN THE GLOBAL DIGITAL ECONOMY
http://www.whitehouse.gov/sites/default/files/privacy-final.pdf
Privacy Bill of Rights1) Individual Control: Consumers have a right to
exercise control over what personal data is collected from them and how they use it.
2) Transparency: Consumers have a right to easily understandable and accessible information about privacy and security practices.
3) Respect for Context: Consumers have a right to expect that organizations will collect, use, and disclose personal data in ways consistent with the context in which consumers provide the data.
Privacy Bill of Rights4) Security: Consumers have a right to secure and
responsible handling of personal data.5) Access and Accuracy: Consumers have a right to
access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequences to citizens if the data is inaccurate.
Privacy Bill of Rights6) Focused Collection: Consumers have a right to
reasonable limits on the personal data that organizations collect and retain.
7) Accountability: Consumers have a right to have personal data handled by organizations with appropriate measures in place to assure they adhere to the Consumer Privacy Bill of Rights.
One View
The Privacy Bill of Rights could serve as a requirements framework for an ideally privacy-aware Internet service.
...if it were adopted.
Limitations
• The Privacy Bill of Rights is subject to interpretation.
• What is “reasonable”?• What is “context”?• What is “personal data”?
• The Privacy Bill of Rights presents technical challenges.
Personal Data Protection in EU• The Data Protection Directive* (aka Directive 95/46/EC on
the protection of individuals with regard to the processing of personal data and on the free movement of such data) is an EU directive adopted in 1995 which regulates the processing of personal data within the EU. It is an important component of EU privacy and human rights law.
• The General Data Protection Regulation, in progress since 2012 and adopted in April 2016, will supersede the Data Protection Directive and be enforceable as of 25 May 2018
• Objectives• Give control of personal data to citizens• Simplify regulatory environment for businesses
* A directive is a legal act of the European Union, which requires member states to achieve a particular result without dictating the means of achieving that result.
When is it legitimate…?Collecting and processing the personal data of individuals is only legitimate in one of the following circumstances (Article 7 of Directive):• Individual gives unambiguously consent• If data processing is needed for a contract (e.g. electricity bill)• If processing is required by a legal obligation• If processing is necessary to protect the vital interests of the
person (e.g. processing medical data of an accident victim)• If processing is necessary to perform tasks of public interest• If data controller or third party have a legitimate interest in
doing so, as long as this does not affect the interests of the data subject or infringe his/her fundamental rights
Obligations of data controllers in EU
Respect for the following rules:• Personal Data collected and used for explicit and
legitimate purposes • It must be adequate, relevant and not excessive in relation
to the above purposes• It must be accurate and updated when needed• Data subjects must be able to correct, remove, etc.
incorrect data about themselves (access)• Personal data should not be kept longer than necessary• Data controllers must protect personal data (incl. from
unauthorized access to third parties) using appropriate measures of protection (security, accountability)
Handling sensitive dataDefinition of sensitive data in EU:
• religious beliefs• political opinions• health• sexual orientation• race• trade union membership
Processing sensitive data comes under stricter set of rules (Article 8)
Enforcing data protection in EU?
• The Directive states that every EU country must provide one or more independent supervisory authorities to monitor its application.
• In principle, all data controllers must notify their supervisory authorities when they process personal data.
• The national authorities are also in charge of receiving and handling complaints from individuals.
Data Protection: US vs EU• US has no legislation that is comparable to EU’s
Data Protection Directive.• US privacy legislation is adopted on ad hoc basis,
e.g. when certain sectors and circumstances require (HIPAA, CTPCA, FCRA)
• US adopts a more laissez-faire approach• In general, US privacy legislation is considered
“weaker” compared to EU
Example: What Is Sensitive Data?
Public records indicate you own a house.
Example: What Is Sensitive Data?
A geo-tagged photo taken by a friend reveals who attended your party!
Example: What Is Sensitive Data?
Facial recognition match with a public record: Prior arrest for drug offense!
Example: What Is Sensitive Data?
1) Public records indicate you own a house2) A geo-tagged photo taken by a friend reveals who
attended your party3) Facial recognition match with a public record:
Prior arrest for drug offense!
→ “You associate with convicts”
Example: What Is Sensitive Data?
“You associate with convicts”
What will this do for your reputation when you:• Date?• Apply for a job?• Want to be elected to public office?
Example: What Is Sensitive Data?
But: Which of these is the sensitive data?a) Public record: You own a houseb) Geotagged photo taken by a friend at your partyc) Public record: A friend’s prior arrest for a drug
offensed) Conclusion: “You associate with convicts.”e) None of the above.
Who Is to Blame?
a) The government, for its Open Data policy?b) Your friend who posted the photo?c) The person who inferred data from publicly
available information?
Part II: User Perceptions About Privacy
Study 1: Users’ Understandings of Privacy
The Teaching Privacy Project
• Goal: Create a privacy curriculum for K-12 and undergrad, with lesson plans, teaching tools, visualizations, etc.
• NSF sponsored. (CNS 1065240 and DGE-1419319; ‐all conclusions ours.)
• Check It Out: Info, public education, and teaching resources: http://teachingprivacy.org
Based on Several Research Strands• Joint work between Friedland, Bernd, Serge Egelman,
Dan Garcia, Blanca Gordo, and many others! • Understanding of user perceptions comes from:
• Decades of research comparing privacy comprehension, preferences, concerns, and behaviors, including by Egelman and colleagues at CMU
• Research on new Internet users’ privacy perceptions, including Gordo’s evaluations of digital-literacy programs
• Observation of multimedia privacy leaks, e.g. “cybercasing” study
• Reports from high school and undergraduate teachers about students’ misperceptions
• Summer programs for high schoolers interested in CS
Common Research Threads• What happens on the Internet affects the “real”
world.• However: Group pressure, impulse, convenience,
and other factors usually dominate decision making.
• Aggravated by lack of understanding of how sharing on the Internet really works.
• Wide variation in both comprehension and actual preferences.
Multimedia Motivation• Many current multimedia R&D applications have a
high potential to compromise the privacy of Internet users.
• We want to continue pursuing fruitful and interesting research programs!
• But we can also work to mitigate negative effects by using our expertise to educate the public about effects on their privacy.
What Do People Need to Know?Starting point: 10 observations about frequent misperceptions + 10 “privacy principles” to address them
Illustrations by Ketrina Yim.
Misconception #1• Perception: I keep track of what I’m posting. I am in
control. Websites are like rooms, and I know what’s in each of them.
• Reality: Your information footprint is larger than you think!
• An empty Twitter post has kilobytes of publicly available metadata.
• Your footprint includes what others post about you, hidden data attached by services, records of your offline activities… Not to mention inferences that can be drawn across all those “rooms”!
Misconception #2• Perception: Surfing is anonymous. Lots of sites allow
anonymous posting.• Reality: There is no anonymity on the Internet.
•Bits of your information footprint — geo-tags, language patterns, etc. — may make it possible for someone to uniquely identify you, even without a name.
Misconception #3• Perception: There’s nothing interesting about what I
do online.• Reality: Information about you on the Internet will
be used by somebody in their interest — including against you.
•Every piece of informationhas value to somebody: otherpeople, companies, organizations, governments...•Using or selling your data is how Internet companies that provide “free” services make money.
Misconception #4• Perception: Communication on the Internet is
secure. Only the person I’m sending it to will see the data.
• Reality: Communication over a network, unless strongly encrypted, is never just between two parties.
•Online data is always routed through intermediary computersand systems… •Which are connected to many more computers and systems...
Misconception #5• Perception: If I make a mistake or say something
dumb, I can delete it later. Anyway, people will get what I mean, right?
• Reality: Sharing information over a network means you give up control over that information — forever!
•The Internet never forgets. Search engines, archives, and reposts duplicate data; you can’t “unshare”.•Websites sell your information, and data can be subpoenaed.
•Anything shared online is open to misinterpretation. The Internet can’t take a joke!
Misconception #6• Perception: Facial recognition/speaker ID isn’t good
enough to find this. As long as no one can find it now, I’m safe.
• Reality: Just because it can’t be found today, it doesn’t mean it can’t be found tomorrow.
•Search engines get smarter.•Multimedia retrieval gets better.•Analog information gets digitized.•Laws, privacy settings, and privacy policies change.
Misconception #7
• Perception: What happens on the Internet stays on the Internet.
• Reality: The online world is inseparable from the “real” world.
•Your online activities are as much a part of your life as your offline activities.•People don’t separate what they know about Internet-you from what they know about in-person you.
Misconception #8• Perception: I don’t chat with strangers. I don’t
“friend” people on Facebook that I don’t know.• Reality: Are you sure? Identity isn’t guaranteed on
the Internet.•Most information that “establishes” identity in social networks may already be public.•There is no foolproof way to match a real person with their online identity.
Misconception #9• Perception: I don’t use the Internet. I am safe.• Reality: You can’t avoid having an information
footprint by not going online.•Friends and family will post about you.•Businesses and government share data about you.•Companies track transactions online.•Smart cards transmit data online.
Misconception #10• Perception: There’s laws that keep companies and
people from sharing my data. If a website has a privacy policy, that means they won’t share my information. It’s all good.
• Reality: Only you have an interest in maintaining your privacy!
•Internet technology is rarely designed to protect privacy. •“Privacy policies” are there to protect providers from lawsuits. •Laws are spotty and vary from place to place. •Like it or not, your privacy is your own responsibility!
What Came of All This?Example: “Ready or Not?” educational app
LINK
What Came of All This?Example: “Digital Footprints” video
Study 2: Perceived vs. Actual Predictability of Personal Information in Social Nets
Papadopoulos and Kompatsiaris with Eleftherios Spyromitros- Xioufis, Giorgos Petkos, and Rob Heyman (iMinds)
Personal Information in OSNsParticipation in OSNs comes at a price!• User-related data is shared with:
• a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks)
• Disclosure of specific types of data:• e.g. gender, age, ethnicity, political or religious beliefs,
sexual preferences, employment status, etc.
• Information isn’t always explicitly disclosed!• Several types of personal information can be accurately
inferred based on implicit cues (e.g. Facebook likes) and machine learning! (cf. Part III)
Inferred Information & Privacy in OSNs
• Study of user awareness with regard to inferred information largely neglected by social research.
• Privacy usually presented as a question of giving access or communicating personal information to some party, e.g.:
“The claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.” (Westin, 1970)
[1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.
Inferred Information & Privacy in OSNs
• However, access control is non-existent for inferred information:
• Users are unaware of the inferences being made.• Users have no control over the way inferences are made.
• Goal: Investigate whether and how users intuitively grasp what can be inferred from their disclosed data!
Main Research Questions1. Predictability: How predictable are different types of
personal information, based on users’ OSN data?2. Actual vs. perceived predictability: How realistic are
user perceptions about the predictability of their personal information?
3. Predictability vs. sensitivity: What is the relationship between perceived sensitivity and predictability of personal information?
• Previous work has focused mainly on Q1• We address Q1 using a variety of data and methods,
and additionally we address Q2 and Q3
Data Collection• Three types of data about 170 Facebook users:
• OSN data: Likes, posts, images -- collected through a test Facebook application
• Answers to questions about 96 personal attributes, organized into 9 categories, e.g. health factors, sexual orientation, income, political attitude, etc.
• Answers to questions related to their perceptions about the predictability and sensitivity of the 9 categories
http://databait.eu http://www.usemp-project.eu
Example From Questionnaire• What is your sexual orientation? →
ground truth• Do you think the information on your
Facebook profile reveals your sexual orientation? Either because you yourself have put it online, or it could be inferred from a combination of posts. → perceived predictability
• How sensitive do you find the information you had to reveal about your sexual orientation? (1=not sensitive at all, 7= very sensitive) → perceived sensitivity
Response #heterosexual 147
homosexual 14
bisexual 7
n/a 2
Response #yes 134
no 33
n/a 3
Features Extracted From OSN Data• likes: binary vector denoting presence/absence of a like (#3.6K)• likesCats: histogram of like category frequencies (#191)• likesTerms: Bag-of-Words (BoW) of terms in description, title,
and about sections of likes (#62.5K)• msgTerms: BoW vector of terms in user posts (#25K)• lda-t: Distribution of topics in the textual contents of both likes
(description, title, and about section) and posts• Latent Dirichlet Allocation with t=20,30,50,100
• visual: concepts depicted in user images (#11.9K), detected using CNN, top 12 concepts per images, 3 variants
• visual-bin: hard 0/1 encoding• visual-freq: concept frequency histogram• visual-conf: sum of detection scores across all images
Experimental Setup• Evaluation method: repeated random sub-sampling
• Data split randomly =10 times into train (67%) / test (33%)𝑛• Model fit on train / accuracy of inferences assessed on test• 96 questions (user attributes) were considered
• Evaluation measure: area under ROC curve (AUC)• Appropriate for imbalanced classes
• Classification algorithms• Baseline: -nearest neighbors, decision tree, naïve Bayes𝑘• SoA: Adaboost, random forest, regularized logistic regression
Predictability per Attribute
nationality
is employed
can be moodysmokes cannabis
plays volleyball
What Is More Predictable?Rank
Perceived Actual predictability Predictability SoA*
1 Demographics Demographics - Demographics2 Relationship status
and living conditionPolitical views +3 Political views
3 Sexual orientation Sexual orientation - Religious views4 Consumer profile Employment/
Income+4 Sexual orientation
5 Political views Consumer profile -1 Health status6 Personality traits Relationship status
and living condition-4 Relationship status
and living condition7 Religious views Religious views -8 Employment/
IncomeHealth status +1
9 Health status Personality traits -3
* Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.
Predictability Versus Sensitivity
Part III: Multimodal Inferences
Personal Data: Truly Multimodal• Text: posts, comments, content of articles you
read/like, etc.• Images/Videos: posted by you, liked by you, posted
by others but containing you• Resources: likes, visited websites, groups, etc.• Location: check-ins, GPS of posted images, etc.• Network: what your friends look like, what they
post, what they like, community where you belong• Sensors: wearables, fitness apps, IoT
What Can Be Inferred?A lot….
Three Main Approaches
• Content-based• What you post is what/where/how/etc. you are
• Supervised learning• Learn by example
• Network-based• Show me your friends and I’ll tell you who you are
Content-BasedBeware of your posts…
LocationMultimodal Location Estimation
Multimodal Location Estimation
We infer the location of a video based on visual stream, audio stream, and tags:• Use geo-tagged data as training data• Allows faster search, inference, and intelligence-
gathering, even without GPS.
G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation," pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.
Intuition for the Approach{berkeley, sathergate, campanile}
{berkeley, haas}
{campanile} {campanile, haas}
Node: Geolocation of video
Edge: Correlated locations (e.g. common tag, visual, acoustic features)
Edge Potential: Strength of an edge (e.g. posterior distribution of locations given common tags)
MediaEval
J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran: "Multimodal Location Estimation of Consumer Media: Dealing with Sparse Training Data," in Proceedings of IEEE ICME 2012, Melbourne, Australia, July 2012.
YouTube Cybercasing Revisited
YouTube Cybercasing With Geo-Tags vs. Multimodal Location Estimation
Old Experiment No Geo-Tags
Initial Videos 1000 (max) 107
User Hull ~50k ~2000
Potential Hits 106 112
Actual Targets >12 >12
Account LinkingCan we link accounts based on their content?
Using Internet Videos: Dataset
Test videos from Flickr (~40 sec)• 121 users to be matched, 50k trials• 70% have heavy noise• 50% speech• 3% professional content
H. Lei, J. Choi, A. Janin, and G. Friedland: “Persona Linking: Matching Uploaders of Videos Across Accounts”, at IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Prague, May 2011.
Matching Users Within Flickr
Algorithm: 1) Take 10 seconds of the soundtrack of a video2) Extract the Spectral Envelope3) Compare using Manhattan Distance
Spectral Envelope
User ID on Flickr Videos
Persona Linking Using Internet Videos
Result:• On average, having 40 seconds in the
test and training sets leads to a 99.2% chance for a true positive match!
Another Linkage AttackExploiting users’ online activity to link accounts• Link based on where and when a user is posting • Attack model is individual targeting • Datasets: Yelp, Flickr, Twitter• Methods
• Location profile• Timing profile
When a User Is Posting
Where a User Is Posting
- Twitter locations
- Yelp locations
De-Anonymization Model
Targeted account(YELP users are ID’d) Candidate
list
how similar?
Datasets
• Three social networks: Yelp, Twitter, Flickr• Two types of data sets
• Ground truth data set• Yelp-Twitter: 2,363 -> 342 (with geotags) -> 57 (in SF bay)• Flickr-Twitter: 6,196 -> 396 (with geotags) -> 27 (in SF bay)
• Candidate Twitter list data set: 26,204
Performance on Matching
Supervised LearningLearn by example
Inferring Personal Information• Supervised learning algorithms
• Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦𝑖 by analyzing a set of training examples =(𝐷 𝒙𝑖,𝑦𝑖 )i
𝑁• In this case
• 𝑦𝑖 corresponds to a personal user attribute, e.g. sexual orientation• 𝒙𝑖 corresponds to a set of predictive attributes or features, e.g. user likes
• Some previous results• Kosinski et al. [1]: likes features (SVD) + logistic regression: Highly accurate
inferences of ethnicity, gender, sexual orientation, etc.• Schwartz et al. [2] status updates (PCA) + linear SVM: Highly accurate
inference of gender
[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.[2] Schwartz, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 2013.
What Do Your Likes Say About You?
M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013
M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013
Results: Prediction Accuracy
M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013
The More You Like…
Our Results: USEMP Dataset (Part II)
Testing different classifiers
Our Results: USEMP Dataset (Part II)
Testing different features
Our Results: USEMP Dataset (Part II)
Testing combinations of features
Caution: Reliability of Predictions
MODEL 1
MODEL 2
MODEL N
α% training setENSEMBLE
α%
α%
Caution: Reliability of Predictions
Percentage of users, for which individual models have low agreement (Sx<0.5).
Classification accuracy for those users.
MyPersonality dataset (subset)
Conclusions• Representing users as feature vectors and using
supervised learning can help achieve pretty good accuracy in several cases.
However:• There will be several cases where the output of the
trained model will be unreliable (close to random).• For many classifiers and for abstract feature
representations (e.g. SVD), it is very hard to explain why a particular user has been classified as belonging to a given class.
Network-Based Learning Show me your friends….
with Georgios Rizos
Network-Based Classification• People with similar interests tend to connect
→ homophily• Knowing about one’s connections
could reveal information about them
• Knowing aboutthe whole networkstructure could revealeven more…
My Social Circles
A variety of affiliations:• Work• School• Family• Friends…
SoA: User Classification (1)Graph-based semi-supervised learning:• Label propagation (Zhu and Ghahramani, 2002)• Local and global consistency (Zhou et al., 2004)
Other approaches to user classification:• Hybrid feature engineering for inferring user behaviors
(Pennacchiotti et al., 2011 , Wagner et al., 2013)• Crowdsourcing Twitter list keywords for popular users
(Ghosh et al., 2012)
SoA: Graph Feature Extraction (2)Use of community detection:• EdgeCluster: Edge centric k-means (Tang and Liu, 2009)• MROC: Binary tree community hierarchy (Wang et al., 2013)
Low-rank matrix representation methods:• Laplacian Eigenmaps: k eigenvectors of the graph Laplacian
(Belkin and Niyogi, 2003 , Tang and Liu, 2011)• Random-Walk Modularity Maximization: Does not suffer from the
resolution limit of ModMax (Devooght et al., 2014)• Deepwalk: Deep representation learning (Perozzi et al., 2014)
Overview of FrameworkOnline social interactions (retweets, mentions, etc.)
Social interaction user graph
ARCTE
Partial/Sparse Annotation
Supervised graph feature representation
Feature Weighting
User Label Learning
Classified Users
ARCTE: Intuition
Evaluation: Datasets
Ground truth generation:• SNOW2014 Graph: Twitter list aggregation & post-processing• IRMV-PoliticsUK: Manual annotation• ASU-YouTube: User membership to group• ASU-Flickr: User subscription to interest group
Datasets Labels Vertices Vertex Type Edges Edge Type
SNOW2014 Graph(Papadopoulos et al., 2014)
90 533,874 Twitter Account
949,661 Mentions + Retweets
IRMV-PoliticsUK(Greene & Cunningham, 2013)
5 419 Twitter Account
11,349 Mentions + Retweets
ASU-YouTube(Mislove et al., 2007)
47 1,134,890 YouTube Channel
2,987,624 Subscriptions
ASU-Flickr(Tang and Liu, 2009)
195 80,513 Flickr Account 5,899,882 Contacts
Example: Twitter
Twitter Handle Labels@nytimes usa, press,
new york@HuffPostBiz finance@BBCBreaking press,
journalist, tv@StKonrath journalist
Examples from SNOW 2014 Data Challenge dataset
Evaluation: SNOW 2014 datasetSNOW2014 Graph (534K, 950K): Twitter mentions + retweets, ground truth based on Twitter list processing
Evaluation: ASU-YouTube• ASU-YouTube (1.1M, 3M): YouTube subscriptions, ground
truth based on membership to groups
Part IV: Some Possible Solutions
Solution 1: Disclosure Scoring Framework
with Georgios Petkos
Problem and Motivation• Several studies have shown that privacy is a challenging
issue in OSNs. •Madejski et al. performed a study with 65 users asking them to carefully examine their profiles → all of them identified a sharing violation.
• Information about a user may appear not only explicitly, but also implicitly, and may therefore be inferred (also think of institutional privacy).
• Different users have different attitudes towards privacy and online information sharing (Knijnenbourg, 2013).
Madejski et al., “A study of privacy setting errors in an online social network”. PERCOM, 2012Knijnenbourg, “Dimensionality of information disclosure behavior”. IJHCS, 2013
Disclosure Scoring“A framework for quantifying the type of information one is sharing, and the extent of such disclosure.”
Requirements:• It must take into account the fact that privacy
concerns are different across users.• Different types of information have different
significance to users.• Must take into account both explicit and inferred
information.
Related Work1. Privacy score [Liu10]: based on the concepts of
visibility and sensitivity:
2. Privacy Quotient and Leakage [Srivastava13]
3. Privacy Functionality Score [Ferrer10]
4. Privacy index [Nepali13]
5. Privacy Scores [Sramka15]
Types of Personal Information
aka Disclosure Dimensions
Overview of PScore
A F
A.1 A.6 F.1 F.3A.5
• Explicitly Disclosed / Inferred• Value / Predicted Value• Confidence of Prediction• Level of Sensitivity
• Level of Disclosure• Reach of Disclosure• Level of Sensitivity
Observed data (URLs, likes, posts)
Inference Algorithms
0101 1101 1001
Disclosure Dimensions
User Attributes
Example
Visualization
Bubble color/size proportional to disclosure score → red/big corresponds to more sensitive/risky
Visualization
Hierarchical exploration of types of personal information.
http://usemp-mklab.iti.gr/usemp/
Solution 2: Personalized Privacy-Aware Image Classification
with Eleftherios Spyromitros-Xioufis and Adrian Popescu (CEA-LIST)
Privacy-Aware Image Classification
• Photo sharing may compromise privacy
• Can we make photo sharing safer?• Yes: build “private” image detectors
• Alerts whenever a “private” image is shared• Personalization is needed because privacy is subjective!
-Would you share such an image? -Does it depend with whom?
Previous Work, and Limitations• Focus on generic (“community”) notion of privacy
• Models trained on PicAlert [1]: Flickr images annotated according to a common privacy definition
• Consequences:• Variability in user perceptions not captured • Over-optimistic performance estimates
• Justifications are barely comprehensible
[1] Zerr et al., I know what you did last summer!: Privacy-aware image classification and search, CIKM, 2012.
Goals of the Study
• Study personalization in image privacy classification• Compare personalized vs. generic models• Compare two types of personalized models
• Semantic visual features• Better justifications and privacy insights
• YourAlert: more realistic than existing benchmarks
Personalization Approaches
• Full personalization: • A different model for each user, relying only on their
feedback• Disadvantage: requires a lot of feedback
• Partial personalization: • Models rely on user feedback + feedback from other
users• Amount of personalization controlled via instance
weighting
Visual and Semantic Features• vlad [1]: aggregation of local image descriptors
• cnn [2]: deep visual features
• semfeat [3]: outputs of ~17K concept detectors • Trained using cnn• Top 100 concepts per image
[1] Spyromitros-Xioufis et al., A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Transactions on Multimedia, 2014.[2] Simonyan and Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv, 2014.[3] Ginsca et al., Large-Scale Image Mining with Flickr Groups, MultiMedia Modeling, 2015.
Explanations via Semfeat
• Semfeat can be used to justify predictions• A tag cloud of the most discriminative visual concepts
• Explanations may often be confusing• Concept detectors are not perfect• Semfeat vocabulary (ImageNet) is not privacy-oriented
knitwear
young-back
hand-glasscigar-smoker
smoker
drinker
Freudian
semfeat-LDA: Enhanced Explanations
• Project semfeat to a latent space (second level semantic representation)
• Images treated as text documents (top 10 concepts)• Text corpus created from private images (Pic+YourAlert)• LDA is applied to create a topic model (30 topics)• 6 privacy-related topics are identified (manually)
Topic Top 5 semfeat concepts assigned to each topicchildren dribbler child godson wimp niecedrinking drinker drunk tipper thinker drunkarderotic slattern erotic cover-girl maillot back
relatives great-aunt second-cousin grandfather mother great-grandchild
vacations seaside vacationer surf-casting casting sandbankwedding groom bride celebrant wedding costume
semfeat-LDA: Example
knitwear
young-back
hand-glasscigar-smoker
smoker
drinker
Freudian
1st level semantic representation
2nd level semantic representation
YourAlert: A Realistic Benchmark• User study
• Participants annotate their own photos (informed consent, only extracted features shared)
• Annotation based on the following definitions:• Private: “would share only with close OSN friends or not at all”• Public: “would share with all OSN friends or even make public”
• Resulting dataset: YourAlert• 1.5K photos, 27 users, ~16 private/40 public per user• Main advantages:
•Facilitates realistic evaluation of privacy models •Allows development of personalized models
Publicly available at: http://mklab.iti.gr/datasets/image-privacy/
Generic Models: PicAlert vs. YourAlert
Key Findings• Almost perfect performance for PicAlert with CNN
• semfeat performs similarly to CNN
• Significantly worse performance for YourAlert• Similar performance for all features
• Additional findings• Using more generic training examples does not help• Large variability in performance across users
Personalized privacy models• Evaluation carried out on YourAlert
• A modified k-fold cross-validation for unbiased estimates
• Personalized model types• ‘user’: only user-specific examples from YourAlert• ‘hybrid’: a mixture of user-specific examples from
YourAlert and generic examples from PicAlert• User-specific examples are weighted higher
Evaluation of Personalized Models
PicAlert YourAlertu1
3-fold cv
k=1 test set
u2 u3
Model type: ‘user’
Evaluation of Personalized Models
PicAlert YourAlertu1
3-fold cv
k=1 test set
u2 u3
Model type: ‘user’
Evaluation of Personalized Models
PicAlert YourAlertu1
3-fold cv
k=1 test set
u2 u3
Model type: ‘user’
Evaluation of Personalized Models
PicAlert YourAlertu1
3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=1’
Evaluation of Personalized Models
PicAlert YourAlertu1 3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=1’
Evaluation of Personalized Models
PicAlert YourAlertu1 3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=1’
Evaluation of Personalized Models
PicAlert YourAlertu1 3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=2’
Evaluation of Personalized Models
PicAlert YourAlertu1 3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=2’
Evaluation of Personalized Models
PicAlert YourAlertu1
3-fold cv
k=1 test set
u2 u3
Model type: ‘hybrid w=2’
Results
Privacy Insights via Semfeat
child mate son
privateuphill
lakefront waterside
public
Identifying Recurring Privacy Themes
• A prototype semfeat-LDA vector for each user• The centroid of the semfeat-LDA vectors of their private
images
• K-means (k=5) clustering on the prototype vectors
Would you share the following?With whom would you share the photos in the following slides:
a)familyb)friendsc)colleaguesd)your Facebook friendse)everyone (public)
Part V: Future Directions
Towards Private Multimedia Systems
We should:• Research methods to help mitigate risks and offer
choice.• Develop privacy policies and APIs that take into
account multimedia retrieval.• Educate users and engineers on privacy issues.
...before panic slows progress in the multimedia field.
The Role of Research
Research can help:• Describe and quantify risk factors• Visualize and offer choices in UIs• Identify privacy-breaking information• Filter out “irrelevant information” through content
analysis
Reality Check
Can we build a privacy-proof system?
No. We can’t build a theft-proof car either.
However, we can make it more or less privacy-proof.
Emerging Issue: Internet of Things
Graphic by Applied Materials using International Data Corporation data.
Emerging Issue: Wearables
Source: Amish Gandhi via SlideShare
Multimedia Things• Much of the IoT data collected is multimedia data.
•Requires (exciting!) new approaches to real-time multimedia content analysis. →•Presents new threats to security and privacy. →•Requires new best practices for Security and Privacy by Design and new privacy enhancing technologies (PETs). →•Presents opportunities to work on privacy enhancements to multimedia!
Example IoT Advice From Future of Privacy Forum• Get creative with using multimedia affordances (visual,
audio, tactile) to alert users to data collection.• Respect for context: Users may have different expectations
for data they input manually and data collected by sensors.• Inform users about how their data will be used.• Choose de-identification practices according to your
specific technical situation. •In fact, multimedia expertise can contribute to improving de-identification!
• Build trust by allowing users to engage with their own data, and to control who accesses it.
Source: Christopher Wolf, Jules Polonetsky, and Kelsey Finch, A Practical Privacy Paradigm for Wearables. Future of Privacy Forum, 2015.
One Privacy Design Practice Above All
Think about privacy (and security) as you BEGIN designing a system or planning a
research program. Privacy is not an add-on!
Describing RisksA Method from Security Research• Build a model for potential attacks
as a set of:• attacker properties • attack goals
• Proof your system against it as much as possible.• Update users’ expectations about residual risk.
Attacker Properties: Individual Privacy
• Resources• individual/institutional/moderate resource
• Target Model• targeted individual/easiest k of N/everyone
• Database access• full (private, public) data access/well-indexed
access/poorly indexed access/hard retrieval/soft retrieval (multimedia)
Goals of Privacy Attacks• Cybercasing (attack preparation)• Cyberstalking• Socio-Economic profiling• Espionage (industry, country)• Cybervetting• Cyberframing
Towards Privacy-Proof MM Systems
• Match users’ expectations of privacy in system behavior (e.g. include user evaluation)
• If that’s not possible, educate users about risks• Ask yourself: What is the best trade-off for the
users between privacy, utility, and convenience?• Don’t expose as much information as possible,
expose only as much information as is required!
Engineering Rules From the Privacy Community
• Inform users of the privacy model and quantify the possible audience:
• Public/link-to-link/semi-public/private • How many people will see the information (avg. friends-
of-friends on Facebook: 70k people!)
• If users expect anonymity, explain the risks of exposure
• Self-posting of PII, hidden meta-data, etc.• Provide tools that make it easier to stay (more)
anonymous based on expert knowledge (e.g. erase EXIF)
Engineering Rules from the Privacy Community
• Show users what metadata is collected by your service/app and to whom it is made available (AKA “Privacy Nutrition Label”)
• At the least, offer an opt-out!• Make settings easily configurable (Facebook is not
easily configurable)• Offer methods to delete and correct data• If possible, trigger search engine updating after deletion• If possible, offer “deep deletion” (i.e. delete re-posts, at
least within-system)
Closing Thought Exercise: Part 1Take two minutes to think about the following questions:• What’s your area of expertise? What are you
working on right now?• How does it interact with privacy? What are the
potential attacks and potential consequences?• What can you do to mitigate negative privacy
effects?• What can you do to educate users about possible
privacy implications?
Closing Thought Exercise: Part 2• Turn to the person next to you and share your
thoughts. Ask each other questions!• You have five minutes.
AcknowledgmentsWork together with: • Jaeyoung Choi, Luke Gottlieb, Robin Sommer,
Howard Lei, Adam Janin, Oana Goga, Nicholas Weaver, Dan Garcia, Blanca Gordo, Serge Egelman, and others
• Georgios Petkos, Eleftherios Spyromitros-Xioufis, Adrian Popescu, Rob Heyman, Georgios Rizos, Polychronis Charitidis, Thomas Theodoridis and others
Thank You!Acknowledgements: • This material is based upon work supported by the
US National Science Foundation under Grant No. CNS-1065240 and DGE-1419319, and by the European Commission under Grant No. 611596 for the USEMP project.
• Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding bodies.