Multimedia Privacy

Multimedia PrivacyGerald FriedlandSymeon PapadopoulosJulia BerndYiannis Kompatsiaris

ACM Multimedia, Amsterdam, October 16, 2016

What’s the Big Deal?

http://youtube.com/v/F7pYHN9iC9I

Overview of Tutorial

• Part I: Understanding the Problem• Part II: User Perceptions About Privacy• Part III: Multimodal Inferences• Part IV: Some Possible Solutions• Part V: Future Directions

Part I: Understanding the Problem

What Can a Mindreader Read?• This vulnerability is a problem with any type of

public or semi-public post. They’re not specific to a particular type of information, e.g. text, image, or video.

• However, let’s focus on multimedia data: images, audio, video, social media context, etc.

Multimedia on the Internet Is Big!

Source: Domosphere

Resulting Problem• More multimedia data = Higher demand for

retrieval and organization tools.• But multimedia retrieval is hard!

• Researchers work on making retrieval better (cf. latest advances in Deep Learning for content-based retrieval).

• Industry develops workarounds to make retrieval easier right away.

Hypothesis

• Retrieval is already good enough to cause major issues for privacy that are not easy to solve.

• Let’s take a look at some retrieval approaches:• Image tagging• Geo-tagging• Multimodal Location Estimation• Audio-based user matching

Workaround: Manual Tagging

Workaround: Geo-Tagging

Source: Wikipedia

Geo-Tagging

Allows easier clustering of photo and video series, among other things.

Geo-Tagging EverywherePart of the location-based service hype:

But: Geo-coordinates + Time = Unique ID!

Support for Geo-Tags• Social media portals provide APIs to connect geo-

tags with metadata, accounts, and web content.

• Allows easy search, retrieval, and ad placement.

Portal %* Total

YouTube 3.0 3M

Flickr 4.5 180M

*estimate (2013)

Hypothesis

• Since geo-tagging is a workaround for multimedia retrieval, it allows us to peek into a future where multimedia retrieval works perfectly.

• What if multimedia retrieval actually just worked?

Related Work

“Be careful when using social location sharing services, such as Foursquare.”

Related Work

Mayhemic Labs, June 2010: “Are you aware that Tweets are geo-tagged?”

Can you do real harm?• Cybercasing: Using online (location-based) data and

services to enable physical-world crimes.

• Three case studies:

G. Friedland and R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Proceedings of the Fifth USENIX Workshop on Hot Topics in Security (HotSec 10), Washington, D.C, August 2010.

Case Study 1: Twitter• Pictures in Tweets can be geo-tagged

• From a tech-savvy celebrity we found:• Home location (several pics)• Where the kids go to school• Where he/she walks the dog• “Secret” office

Celebs Unaware of Geo-Tagging

Source: ABC News

Celebs Unaware of Geotagging

Google Maps Shows Address...

Case Study 2: Craigslist

“For Sale” section of Bay Area Craigslist.com:• 4 days: 68,729 pictures total - 1.3% geo-tagged

Users Are Unaware of Geo-Tagging

• Many “anonymized” ads had geo-location • Sometimes selling high-value goods, e.g. cars,

diamonds, etc.• Sometimes “call Sunday after 6pm”• Multiple photos allow interpolation of coordinates

for higher accuracy

Craigslist: Real Example

Geo-Tagging Resolution

Measured accuracy: +/- 1m

iPhone 3G picture Google Street View

What About Inference?

Owner

Valuable

Case Study 3: YouTube

Recall:• Once data is published, the Internet keeps it (often

with many copies).• APIs are easy to use and allow quick retrieval of

large amounts of data.

Can we find people on vacation using YouTube?

Cybercasing on YouTubeExperiment: Cybercasing using the YouTube API (240 lines in Python)

Cybercasing on YouTube

Input parametersLocation: 37.869885,-122.270539Radius: 100kmKeywords: kids Distance: 1000km Time-frame: this_week

Cybercasing on YouTubeOutput• Initial videos: 1000 (max_res)

• User hull: ~50k videos • Vacation hits: 106 • Cybercasing targets: >12

The Threat Is Real!

Question

Do you think geo-tagging should be illegal?a) No, people just have to be more careful. The

possibilities still outweigh the risks.b) Maybe it should be regulated somehow to make

sure no harm can be done.c) Yes, absolutely! This information is too

dangerous.

But…

Is this really about geo-tags?

(remember: hypothesis)

But…

Is this really about geo-tags?

No, it’s about the privacy implications of multimedia retrieval in general.

QuestionAnd now? What do you think should be done? a) Nothing can be done. Privacy is dead.b) I will think before I post, but I don’t know that it

matters.c)We need to educate people about this and try to save privacy. (Fight!)

d) I’ll never post anything ever again! (Flight!)

Observations• Many applications encourage heavy data sharing,

and users go with it.• Multimedia isn’t only a lot of data, it’s also a lot of

implicit information.• Both users and engineers often unaware of the

hidden retrieval possibilities of shared (multimedia) data.

• Local anonymization and privacy policies may be ineffective against cross-site inference.

Dilemma• People will continue to want social networks and

location-based services. • Industry and research will continue to improve

retrieval techniques.• Government will continue to do surveillance and

intelligence-gathering.

Solutions That Don’t Work• I blur the faces

•Audio and image artifacts can still give you away

• I only share with my friends •But who are they sharing with, on what platforms?

• I don’t do social networking •Others may do it for you!

Further Observations

• There is not much incentive to worry about privacy, until things go wrong.

• People’s perception of the Internet does not match reality (enough).

Basics: Definitions and Background

Definition

• Privacy is the right to be let alone (Justices Warren and Brandeis)

• Privacy is:a) the quality or state of being apart from company

or observationb) freedom from unauthorized intrusion

(Merriam Webster’s)

Starting Points

• Privacy is a human right. Every individual has a need to keep something about themselves private.

• Companies have a need for privacy.

• Governments have a need for privacy (currently heavily discussed).

Where We’re At (Legally)

Keep an eye out for multimedia inference!

http://youtube.com/v/yzyafieRcWE

A Taxonomy of Social Networking Data

• Service data: Data you give to an OSN to use it, e.g. name, birthday, etc.

• Disclosed data: What you post on your page/space• Entrusted data: What you post on other people’s

pages, e.g. comments• Incidental data: What other people post about you• Behavioural data: Data the site collects about you• Derived data: Data that a third party infers about you

based on all that other data

B. Schneier. A Taxonomy of Social Networking Data, Security & Privacy, IEEE, vol.8, no.4, pp.88, July-Aug. 2010

Privacy Bill of Rights

In February 2012, the US Government released

CONSUMER DATA PRIVACY IN A NETWORKED WORLD:A FRAMEWORK FOR PROTECTING PRIVACY AND PROMOTING

INNOVATION IN THE GLOBAL DIGITAL ECONOMY

http://www.whitehouse.gov/sites/default/files/privacy-final.pdf

http://www.whitehouse.gov/sites/default/files/privacy-final.pdf

Privacy Bill of Rights1) Individual Control: Consumers have a right to

exercise control over what personal data is collected from them and how they use it.

2) Transparency: Consumers have a right to easily understandable and accessible information about privacy and security practices.

3) Respect for Context: Consumers have a right to expect that organizations will collect, use, and disclose personal data in ways consistent with the context in which consumers provide the data.

Privacy Bill of Rights4) Security: Consumers have a right to secure and

responsible handling of personal data.5) Access and Accuracy: Consumers have a right to

access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequences to citizens if the data is inaccurate.

Privacy Bill of Rights6) Focused Collection: Consumers have a right to

reasonable limits on the personal data that organizations collect and retain.

7) Accountability: Consumers have a right to have personal data handled by organizations with appropriate measures in place to assure they adhere to the Consumer Privacy Bill of Rights.

One View

The Privacy Bill of Rights could serve as a requirements framework for an ideally privacy-aware Internet service.

...if it were adopted.

Limitations

• The Privacy Bill of Rights is subject to interpretation.

• What is “reasonable”?• What is “context”?• What is “personal data”?

• The Privacy Bill of Rights presents technical challenges.

https://drive.google.com/open?id=0B1vcGvCQZzykNm5OcnU2OFV3SDA

https://drive.google.com/open?id=0B1vcGvCQZzykemRMMzJwdGl4d0k

Personal Data Protection in EU• The Data Protection Directive* (aka Directive 95/46/EC on

the protection of individuals with regard to the processing of personal data and on the free movement of such data) is an EU directive adopted in 1995 which regulates the processing of personal data within the EU. It is an important component of EU privacy and human rights law.

• The General Data Protection Regulation, in progress since 2012 and adopted in April 2016, will supersede the Data Protection Directive and be enforceable as of 25 May 2018

• Objectives• Give control of personal data to citizens• Simplify regulatory environment for businesses

* A directive is a legal act of the European Union, which requires member states to achieve a particular result without dictating the means of achieving that result.

When is it legitimate…?Collecting and processing the personal data of individuals is only legitimate in one of the following circumstances (Article 7 of Directive):• Individual gives unambiguously consent• If data processing is needed for a contract (e.g. electricity bill)• If processing is required by a legal obligation• If processing is necessary to protect the vital interests of the

person (e.g. processing medical data of an accident victim)• If processing is necessary to perform tasks of public interest• If data controller or third party have a legitimate interest in

doing so, as long as this does not affect the interests of the data subject or infringe his/her fundamental rights

Obligations of data controllers in EU

Respect for the following rules:• Personal Data collected and used for explicit and

legitimate purposes • It must be adequate, relevant and not excessive in relation

to the above purposes• It must be accurate and updated when needed• Data subjects must be able to correct, remove, etc.

incorrect data about themselves (access)• Personal data should not be kept longer than necessary• Data controllers must protect personal data (incl. from

unauthorized access to third parties) using appropriate measures of protection (security, accountability)

Handling sensitive dataDefinition of sensitive data in EU:

• religious beliefs• political opinions• health• sexual orientation• race• trade union membership

Processing sensitive data comes under stricter set of rules (Article 8)

Enforcing data protection in EU?

• The Directive states that every EU country must provide one or more independent supervisory authorities to monitor its application.

• In principle, all data controllers must notify their supervisory authorities when they process personal data.

• The national authorities are also in charge of receiving and handling complaints from individuals.

Data Protection: US vs EU• US has no legislation that is comparable to EU’s

Data Protection Directive.• US privacy legislation is adopted on ad hoc basis,

e.g. when certain sectors and circumstances require (HIPAA, CTPCA, FCRA)

• US adopts a more laissez-faire approach• In general, US privacy legislation is considered

“weaker” compared to EU

Example: What Is Sensitive Data?

Public records indicate you own a house.


A geo-tagged photo taken by a friend reveals who attended your party!


Facial recognition match with a public record: Prior arrest for drug offense!


1) Public records indicate you own a house2) A geo-tagged photo taken by a friend reveals who

attended your party3) Facial recognition match with a public record:

Prior arrest for drug offense!

→ “You associate with convicts”


“You associate with convicts”

What will this do for your reputation when you:• Date?• Apply for a job?• Want to be elected to public office?


But: Which of these is the sensitive data?a) Public record: You own a houseb) Geotagged photo taken by a friend at your partyc) Public record: A friend’s prior arrest for a drug

offensed) Conclusion: “You associate with convicts.”e) None of the above.

Who Is to Blame?

a) The government, for its Open Data policy?b) Your friend who posted the photo?c) The person who inferred data from publicly

available information?

Part II: User Perceptions About Privacy

Study 1: Users’ Understandings of Privacy

The Teaching Privacy Project

• Goal: Create a privacy curriculum for K-12 and undergrad, with lesson plans, teaching tools, visualizations, etc.

• NSF sponsored. (CNS 1065240 and DGE-1419319; ‐all conclusions ours.)

• Check It Out: Info, public education, and teaching resources: http://teachingprivacy.org

http://teachingprivacy.icsi.berkeley.edu/

http://teachingprivacy.icsi.berkeley.edu/

Based on Several Research Strands• Joint work between Friedland, Bernd, Serge Egelman,

Dan Garcia, Blanca Gordo, and many others! • Understanding of user perceptions comes from:

• Decades of research comparing privacy comprehension, preferences, concerns, and behaviors, including by Egelman and colleagues at CMU

• Research on new Internet users’ privacy perceptions, including Gordo’s evaluations of digital-literacy programs

• Observation of multimedia privacy leaks, e.g. “cybercasing” study

• Reports from high school and undergraduate teachers about students’ misperceptions

• Summer programs for high schoolers interested in CS

Common Research Threads• What happens on the Internet affects the “real”

world.• However: Group pressure, impulse, convenience,

and other factors usually dominate decision making.

• Aggravated by lack of understanding of how sharing on the Internet really works.

• Wide variation in both comprehension and actual preferences.

Multimedia Motivation• Many current multimedia R&D applications have a

high potential to compromise the privacy of Internet users.

• We want to continue pursuing fruitful and interesting research programs!

• But we can also work to mitigate negative effects by using our expertise to educate the public about effects on their privacy.

What Do People Need to Know?Starting point: 10 observations about frequent misperceptions + 10 “privacy principles” to address them

Illustrations by Ketrina Yim.

Misconception #1• Perception: I keep track of what I’m posting. I am in

control. Websites are like rooms, and I know what’s in each of them.

• Reality: Your information footprint is larger than you think!

• An empty Twitter post has kilobytes of publicly available metadata.

• Your footprint includes what others post about you, hidden data attached by services, records of your offline activities… Not to mention inferences that can be drawn across all those “rooms”!

Misconception #2• Perception: Surfing is anonymous. Lots of sites allow

anonymous posting.• Reality: There is no anonymity on the Internet.

•Bits of your information footprint — geo-tags, language patterns, etc. — may make it possible for someone to uniquely identify you, even without a name.

Misconception #3• Perception: There’s nothing interesting about what I

do online.• Reality: Information about you on the Internet will

be used by somebody in their interest — including against you.

•Every piece of informationhas value to somebody: otherpeople, companies, organizations, governments...•Using or selling your data is how Internet companies that provide “free” services make money.

Misconception #4• Perception: Communication on the Internet is

secure. Only the person I’m sending it to will see the data.

• Reality: Communication over a network, unless strongly encrypted, is never just between two parties.

•Online data is always routed through intermediary computersand systems… •Which are connected to many more computers and systems...

Misconception #5• Perception: If I make a mistake or say something

dumb, I can delete it later. Anyway, people will get what I mean, right?

• Reality: Sharing information over a network means you give up control over that information — forever!

•The Internet never forgets. Search engines, archives, and reposts duplicate data; you can’t “unshare”.•Websites sell your information, and data can be subpoenaed.

•Anything shared online is open to misinterpretation. The Internet can’t take a joke!

Misconception #6• Perception: Facial recognition/speaker ID isn’t good

enough to find this. As long as no one can find it now, I’m safe.

• Reality: Just because it can’t be found today, it doesn’t mean it can’t be found tomorrow.

•Search engines get smarter.•Multimedia retrieval gets better.•Analog information gets digitized.•Laws, privacy settings, and privacy policies change.

Misconception #7

• Perception: What happens on the Internet stays on the Internet.

• Reality: The online world is inseparable from the “real” world.

•Your online activities are as much a part of your life as your offline activities.•People don’t separate what they know about Internet-you from what they know about in-person you.

Misconception #8• Perception: I don’t chat with strangers. I don’t

“friend” people on Facebook that I don’t know.• Reality: Are you sure? Identity isn’t guaranteed on

the Internet.•Most information that “establishes” identity in social networks may already be public.•There is no foolproof way to match a real person with their online identity.

Misconception #9• Perception: I don’t use the Internet. I am safe.• Reality: You can’t avoid having an information

footprint by not going online.•Friends and family will post about you.•Businesses and government share data about you.•Companies track transactions online.•Smart cards transmit data online.

Misconception #10• Perception: There’s laws that keep companies and

people from sharing my data. If a website has a privacy policy, that means they won’t share my information. It’s all good.

• Reality: Only you have an interest in maintaining your privacy!

•Internet technology is rarely designed to protect privacy. •“Privacy policies” are there to protect providers from lawsuits. •Laws are spotty and vary from place to place. •Like it or not, your privacy is your own responsibility!

What Came of All This?Example: “Ready or Not?” educational app

LINK

http://app.teachingprivacy.org/



What Came of All This?Example: “Digital Footprints” video

http://youtube.com/v/DNn6ib81TWQ

Study 2: Perceived vs. Actual Predictability of Personal Information in Social Nets

Papadopoulos and Kompatsiaris with Eleftherios Spyromitros- Xioufis, Giorgos Petkos, and Rob Heyman (iMinds)

Personal Information in OSNsParticipation in OSNs comes at a price!• User-related data is shared with:

• a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks)

• Disclosure of specific types of data:• e.g. gender, age, ethnicity, political or religious beliefs,

sexual preferences, employment status, etc.

• Information isn’t always explicitly disclosed!• Several types of personal information can be accurately

inferred based on implicit cues (e.g. Facebook likes) and machine learning! (cf. Part III)

Inferred Information & Privacy in OSNs

• Study of user awareness with regard to inferred information largely neglected by social research.

• Privacy usually presented as a question of giving access or communicating personal information to some party, e.g.:

“The claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.” (Westin, 1970)

[1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.

Inferred Information & Privacy in OSNs

• However, access control is non-existent for inferred information:

• Users are unaware of the inferences being made.• Users have no control over the way inferences are made.

• Goal: Investigate whether and how users intuitively grasp what can be inferred from their disclosed data!

Main Research Questions1. Predictability: How predictable are different types of

personal information, based on users’ OSN data?2. Actual vs. perceived predictability: How realistic are

user perceptions about the predictability of their personal information?

3. Predictability vs. sensitivity: What is the relationship between perceived sensitivity and predictability of personal information?

• Previous work has focused mainly on Q1• We address Q1 using a variety of data and methods,

and additionally we address Q2 and Q3

Data Collection• Three types of data about 170 Facebook users:

• OSN data: Likes, posts, images -- collected through a test Facebook application

• Answers to questions about 96 personal attributes, organized into 9 categories, e.g. health factors, sexual orientation, income, political attitude, etc.

• Answers to questions related to their perceptions about the predictability and sensitivity of the 9 categories

http://databait.eu http://www.usemp-project.eu

http://databait.eu/

http://www.usemp-project.eu/

Example From Questionnaire• What is your sexual orientation? →

ground truth• Do you think the information on your

Facebook profile reveals your sexual orientation? Either because you yourself have put it online, or it could be inferred from a combination of posts. → perceived predictability

• How sensitive do you find the information you had to reveal about your sexual orientation? (1=not sensitive at all, 7= very sensitive) → perceived sensitivity

Response #heterosexual 147

homosexual 14

bisexual 7

n/a 2

Response #yes 134

no 33

n/a 3

Features Extracted From OSN Data• likes: binary vector denoting presence/absence of a like (#3.6K)• likesCats: histogram of like category frequencies (#191)• likesTerms: Bag-of-Words (BoW) of terms in description, title,

and about sections of likes (#62.5K)• msgTerms: BoW vector of terms in user posts (#25K)• lda-t: Distribution of topics in the textual contents of both likes

(description, title, and about section) and posts• Latent Dirichlet Allocation with t=20,30,50,100

• visual: concepts depicted in user images (#11.9K), detected using CNN, top 12 concepts per images, 3 variants

• visual-bin: hard 0/1 encoding• visual-freq: concept frequency histogram• visual-conf: sum of detection scores across all images

Experimental Setup• Evaluation method: repeated random sub-sampling

• Data split randomly =10 times into train (67%) / test (33%)𝑛• Model fit on train / accuracy of inferences assessed on test• 96 questions (user attributes) were considered

• Evaluation measure: area under ROC curve (AUC)• Appropriate for imbalanced classes

• Classification algorithms• Baseline: -nearest neighbors, decision tree, naïve Bayes𝑘• SoA: Adaboost, random forest, regularized logistic regression

Predictability per Attribute

nationality

is employed

can be moodysmokes cannabis

plays volleyball

What Is More Predictable?Rank

Perceived Actual predictability Predictability SoA*

1 Demographics Demographics - Demographics2 Relationship status

and living conditionPolitical views +3 Political views

3 Sexual orientation Sexual orientation - Religious views4 Consumer profile Employment/

Income+4 Sexual orientation

5 Political views Consumer profile -1 Health status6 Personality traits Relationship status

and living condition-4 Relationship status

and living condition7 Religious views Religious views -8 Employment/

IncomeHealth status +1

9 Health status Personality traits -3

* Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.

Predictability Versus Sensitivity

Part III: Multimodal Inferences

Personal Data: Truly Multimodal• Text: posts, comments, content of articles you

read/like, etc.• Images/Videos: posted by you, liked by you, posted

by others but containing you• Resources: likes, visited websites, groups, etc.• Location: check-ins, GPS of posted images, etc.• Network: what your friends look like, what they

post, what they like, community where you belong• Sensors: wearables, fitness apps, IoT

What Can Be Inferred?A lot….

Three Main Approaches

• Content-based• What you post is what/where/how/etc. you are

• Supervised learning• Learn by example

• Network-based• Show me your friends and I’ll tell you who you are

Content-BasedBeware of your posts…

LocationMultimodal Location Estimation

Multimodal Location Estimation

http://mmle.icsi.berkeley.edu

http://mmle.icsi.berkeley.edu/

Multimodal Location Estimation

We infer the location of a video based on visual stream, audio stream, and tags:• Use geo-tagged data as training data• Allows faster search, inference, and intelligence-

gathering, even without GPS.

G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation," pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.

Intuition for the Approach{berkeley, sathergate, campanile}

{berkeley, haas}

{campanile} {campanile, haas}

Node: Geolocation of video

Edge: Correlated locations (e.g. common tag, visual, acoustic features)

Edge Potential: Strength of an edge (e.g. posterior distribution of locations given common tags)

MediaEval

J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran: "Multimodal Location Estimation of Consumer Media: Dealing with Sparse Training Data," in Proceedings of IEEE ICME 2012, Melbourne, Australia, July 2012.

YouTube Cybercasing Revisited

YouTube Cybercasing With Geo-Tags vs. Multimodal Location Estimation

Old Experiment No Geo-Tags

Initial Videos 1000 (max) 107

User Hull ~50k ~2000

Potential Hits 106 112

Actual Targets >12 >12

Account LinkingCan we link accounts based on their content?

Using Internet Videos: Dataset

Test videos from Flickr (~40 sec)• 121 users to be matched, 50k trials• 70% have heavy noise• 50% speech• 3% professional content

H. Lei, J. Choi, A. Janin, and G. Friedland: “Persona Linking: Matching Uploaders of Videos Across Accounts”, at IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Prague, May 2011.

Matching Users Within Flickr

Algorithm: 1) Take 10 seconds of the soundtrack of a video2) Extract the Spectral Envelope3) Compare using Manhattan Distance

Spectral Envelope

User ID on Flickr Videos

Persona Linking Using Internet Videos

Result:• On average, having 40 seconds in the

test and training sets leads to a 99.2% chance for a true positive match!

Another Linkage AttackExploiting users’ online activity to link accounts• Link based on where and when a user is posting • Attack model is individual targeting • Datasets: Yelp, Flickr, Twitter• Methods

• Location profile• Timing profile

When a User Is Posting

Where a User Is Posting

- Twitter locations

- Yelp locations

De-Anonymization Model

Targeted account(YELP users are ID’d) Candidate

list

how similar?

Datasets

• Three social networks: Yelp, Twitter, Flickr• Two types of data sets

• Ground truth data set• Yelp-Twitter: 2,363 -> 342 (with geotags) -> 57 (in SF bay)• Flickr-Twitter: 6,196 -> 396 (with geotags) -> 27 (in SF bay)

• Candidate Twitter list data set: 26,204

Performance on Matching

Supervised LearningLearn by example

Inferring Personal Information• Supervised learning algorithms

• Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦𝑖 by analyzing a set of training examples =(𝐷 𝒙𝑖,𝑦𝑖 )i

𝑁• In this case

• 𝑦𝑖 corresponds to a personal user attribute, e.g. sexual orientation• 𝒙𝑖 corresponds to a set of predictive attributes or features, e.g. user likes

• Some previous results• Kosinski et al. [1]: likes features (SVD) + logistic regression: Highly accurate

inferences of ethnicity, gender, sexual orientation, etc.• Schwartz et al. [2] status updates (PCA) + linear SVM: Highly accurate

inference of gender

[1] Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.[2] Schwartz, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 2013.

What Do Your Likes Say About You?

M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013


Results: Prediction Accuracy


The More You Like…

Our Results: USEMP Dataset (Part II)

Testing different classifiers


Testing different features


Testing combinations of features

Caution: Reliability of Predictions

MODEL 1

MODEL 2

MODEL N

α% training setENSEMBLE

α%

α%

Caution: Reliability of Predictions

Percentage of users, for which individual models have low agreement (Sx<0.5).

Classification accuracy for those users.

MyPersonality dataset (subset)

Conclusions• Representing users as feature vectors and using

supervised learning can help achieve pretty good accuracy in several cases.

However:• There will be several cases where the output of the

trained model will be unreliable (close to random).• For many classifiers and for abstract feature

representations (e.g. SVD), it is very hard to explain why a particular user has been classified as belonging to a given class.

Network-Based Learning Show me your friends….

with Georgios Rizos

Network-Based Classification• People with similar interests tend to connect

→ homophily• Knowing about one’s connections

could reveal information about them

• Knowing aboutthe whole networkstructure could revealeven more…

My Social Circles

A variety of affiliations:• Work• School• Family• Friends…

SoA: User Classification (1)Graph-based semi-supervised learning:• Label propagation (Zhu and Ghahramani, 2002)• Local and global consistency (Zhou et al., 2004)

Other approaches to user classification:• Hybrid feature engineering for inferring user behaviors

(Pennacchiotti et al., 2011 , Wagner et al., 2013)• Crowdsourcing Twitter list keywords for popular users

(Ghosh et al., 2012)

SoA: Graph Feature Extraction (2)Use of community detection:• EdgeCluster: Edge centric k-means (Tang and Liu, 2009)• MROC: Binary tree community hierarchy (Wang et al., 2013)

Low-rank matrix representation methods:• Laplacian Eigenmaps: k eigenvectors of the graph Laplacian

(Belkin and Niyogi, 2003 , Tang and Liu, 2011)• Random-Walk Modularity Maximization: Does not suffer from the

resolution limit of ModMax (Devooght et al., 2014)• Deepwalk: Deep representation learning (Perozzi et al., 2014)

Overview of FrameworkOnline social interactions (retweets, mentions, etc.)

Social interaction user graph

ARCTE

Partial/Sparse Annotation

Supervised graph feature representation

Feature Weighting

User Label Learning

Classified Users

ARCTE: Intuition

Evaluation: Datasets

Ground truth generation:• SNOW2014 Graph: Twitter list aggregation & post-processing• IRMV-PoliticsUK: Manual annotation• ASU-YouTube: User membership to group• ASU-Flickr: User subscription to interest group

Datasets Labels Vertices Vertex Type Edges Edge Type

SNOW2014 Graph(Papadopoulos et al., 2014)

90 533,874 Twitter Account

949,661 Mentions + Retweets

IRMV-PoliticsUK(Greene & Cunningham, 2013)

5 419 Twitter Account

11,349 Mentions + Retweets

ASU-YouTube(Mislove et al., 2007)

47 1,134,890 YouTube Channel

2,987,624 Subscriptions

ASU-Flickr(Tang and Liu, 2009)

195 80,513 Flickr Account 5,899,882 Contacts

Example: Twitter

Twitter Handle Labels@nytimes usa, press,

new york@HuffPostBiz finance@BBCBreaking press,

journalist, tv@StKonrath journalist

Examples from SNOW 2014 Data Challenge dataset

Evaluation: SNOW 2014 datasetSNOW2014 Graph (534K, 950K): Twitter mentions + retweets, ground truth based on Twitter list processing

Evaluation: ASU-YouTube• ASU-YouTube (1.1M, 3M): YouTube subscriptions, ground

truth based on membership to groups

Part IV: Some Possible Solutions

Solution 1: Disclosure Scoring Framework

with Georgios Petkos

Problem and Motivation• Several studies have shown that privacy is a challenging

issue in OSNs. •Madejski et al. performed a study with 65 users asking them to carefully examine their profiles → all of them identified a sharing violation.

• Information about a user may appear not only explicitly, but also implicitly, and may therefore be inferred (also think of institutional privacy).

• Different users have different attitudes towards privacy and online information sharing (Knijnenbourg, 2013).

Madejski et al., “A study of privacy setting errors in an online social network”. PERCOM, 2012Knijnenbourg, “Dimensionality of information disclosure behavior”. IJHCS, 2013

Disclosure Scoring“A framework for quantifying the type of information one is sharing, and the extent of such disclosure.”

Requirements:• It must take into account the fact that privacy

concerns are different across users.• Different types of information have different

significance to users.• Must take into account both explicit and inferred

information.

Related Work1. Privacy score [Liu10]: based on the concepts of

visibility and sensitivity:

2. Privacy Quotient and Leakage [Srivastava13]

3. Privacy Functionality Score [Ferrer10]

4. Privacy index [Nepali13]

5. Privacy Scores [Sramka15]

Types of Personal Information

aka Disclosure Dimensions

Overview of PScore

A F

A.1 A.6 F.1 F.3A.5

• Explicitly Disclosed / Inferred• Value / Predicted Value• Confidence of Prediction• Level of Sensitivity

• Level of Disclosure• Reach of Disclosure• Level of Sensitivity

Observed data (URLs, likes, posts)

Inference Algorithms

0101 1101 1001

Disclosure Dimensions

User Attributes

Example

Visualization

Bubble color/size proportional to disclosure score → red/big corresponds to more sensitive/risky

Visualization

Hierarchical exploration of types of personal information.

http://usemp-mklab.iti.gr/usemp/

http://usemp-mklab.iti.gr/usemp/

Solution 2: Personalized Privacy-Aware Image Classification

with Eleftherios Spyromitros-Xioufis and Adrian Popescu (CEA-LIST)

Privacy-Aware Image Classification

• Photo sharing may compromise privacy

• Can we make photo sharing safer?• Yes: build “private” image detectors

• Alerts whenever a “private” image is shared• Personalization is needed because privacy is subjective!

-Would you share such an image? -Does it depend with whom?

Previous Work, and Limitations• Focus on generic (“community”) notion of privacy

• Models trained on PicAlert [1]: Flickr images annotated according to a common privacy definition

• Consequences:• Variability in user perceptions not captured • Over-optimistic performance estimates

• Justifications are barely comprehensible

[1] Zerr et al., I know what you did last summer!: Privacy-aware image classification and search, CIKM, 2012.

Goals of the Study

• Study personalization in image privacy classification• Compare personalized vs. generic models• Compare two types of personalized models

• Semantic visual features• Better justifications and privacy insights

• YourAlert: more realistic than existing benchmarks

Personalization Approaches

• Full personalization: • A different model for each user, relying only on their

feedback• Disadvantage: requires a lot of feedback

• Partial personalization: • Models rely on user feedback + feedback from other

users• Amount of personalization controlled via instance

weighting

Visual and Semantic Features• vlad [1]: aggregation of local image descriptors

• cnn [2]: deep visual features

• semfeat [3]: outputs of ~17K concept detectors • Trained using cnn• Top 100 concepts per image

[1] Spyromitros-Xioufis et al., A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Transactions on Multimedia, 2014.[2] Simonyan and Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv, 2014.[3] Ginsca et al., Large-Scale Image Mining with Flickr Groups, MultiMedia Modeling, 2015.

Explanations via Semfeat

• Semfeat can be used to justify predictions• A tag cloud of the most discriminative visual concepts

• Explanations may often be confusing• Concept detectors are not perfect• Semfeat vocabulary (ImageNet) is not privacy-oriented

knitwear

young-back

hand-glasscigar-smoker

smoker

drinker

Freudian

semfeat-LDA: Enhanced Explanations

• Project semfeat to a latent space (second level semantic representation)

• Images treated as text documents (top 10 concepts)• Text corpus created from private images (Pic+YourAlert)• LDA is applied to create a topic model (30 topics)• 6 privacy-related topics are identified (manually)

Topic Top 5 semfeat concepts assigned to each topicchildren dribbler child godson wimp niecedrinking drinker drunk tipper thinker drunkarderotic slattern erotic cover-girl maillot back

relatives great-aunt second-cousin grandfather mother great-grandchild

vacations seaside vacationer surf-casting casting sandbankwedding groom bride celebrant wedding costume

semfeat-LDA: Example

knitwear

young-back

hand-glasscigar-smoker

smoker

drinker

Freudian

1st level semantic representation

2nd level semantic representation

YourAlert: A Realistic Benchmark• User study

• Participants annotate their own photos (informed consent, only extracted features shared)

• Annotation based on the following definitions:• Private: “would share only with close OSN friends or not at all”• Public: “would share with all OSN friends or even make public”

• Resulting dataset: YourAlert• 1.5K photos, 27 users, ~16 private/40 public per user• Main advantages:

•Facilitates realistic evaluation of privacy models •Allows development of personalized models

Publicly available at: http://mklab.iti.gr/datasets/image-privacy/

http://mklab.iti.gr/datasets/image-privacy/

Generic Models: PicAlert vs. YourAlert

Key Findings• Almost perfect performance for PicAlert with CNN

• semfeat performs similarly to CNN

• Significantly worse performance for YourAlert• Similar performance for all features

• Additional findings• Using more generic training examples does not help• Large variability in performance across users

Personalized privacy models• Evaluation carried out on YourAlert

• A modified k-fold cross-validation for unbiased estimates

• Personalized model types• ‘user’: only user-specific examples from YourAlert• ‘hybrid’: a mixture of user-specific examples from

YourAlert and generic examples from PicAlert• User-specific examples are weighted higher

Evaluation of Personalized Models

PicAlert YourAlertu1

3-fold cv

k=1 test set

u2 u3

Model type: ‘user’



3-fold cv

k=1 test set

u2 u3

Model type: ‘hybrid w=1’


PicAlert YourAlertu1 3-fold cv

k=1 test set

u2 u3




3-fold cv

k=1 test set

u2 u3


Results

Privacy Insights via Semfeat

child mate son

privateuphill

lakefront waterside

public

Identifying Recurring Privacy Themes

• A prototype semfeat-LDA vector for each user• The centroid of the semfeat-LDA vectors of their private

images

• K-means (k=5) clustering on the prototype vectors

Would you share the following?With whom would you share the photos in the following slides:

a)familyb)friendsc)colleaguesd)your Facebook friendse)everyone (public)

Part V: Future Directions

Towards Private Multimedia Systems

We should:• Research methods to help mitigate risks and offer

choice.• Develop privacy policies and APIs that take into

account multimedia retrieval.• Educate users and engineers on privacy issues.

...before panic slows progress in the multimedia field.

The Role of Research

Research can help:• Describe and quantify risk factors• Visualize and offer choices in UIs• Identify privacy-breaking information• Filter out “irrelevant information” through content

analysis

Reality Check

Can we build a privacy-proof system?

No. We can’t build a theft-proof car either.

However, we can make it more or less privacy-proof.

Emerging Issue: Internet of Things

Graphic by Applied Materials using International Data Corporation data.

Emerging Issue: Wearables

Source: Amish Gandhi via SlideShare

Multimedia Things• Much of the IoT data collected is multimedia data.

•Requires (exciting!) new approaches to real-time multimedia content analysis. →•Presents new threats to security and privacy. →•Requires new best practices for Security and Privacy by Design and new privacy enhancing technologies (PETs). →•Presents opportunities to work on privacy enhancements to multimedia!

Example IoT Advice From Future of Privacy Forum• Get creative with using multimedia affordances (visual,

audio, tactile) to alert users to data collection.• Respect for context: Users may have different expectations

for data they input manually and data collected by sensors.• Inform users about how their data will be used.• Choose de-identification practices according to your

specific technical situation. •In fact, multimedia expertise can contribute to improving de-identification!

• Build trust by allowing users to engage with their own data, and to control who accesses it.

Source: Christopher Wolf, Jules Polonetsky, and Kelsey Finch, A Practical Privacy Paradigm for Wearables. Future of Privacy Forum, 2015.

One Privacy Design Practice Above All

Think about privacy (and security) as you BEGIN designing a system or planning a

research program. Privacy is not an add-on!

Describing RisksA Method from Security Research• Build a model for potential attacks

as a set of:• attacker properties • attack goals

• Proof your system against it as much as possible.• Update users’ expectations about residual risk.

Attacker Properties: Individual Privacy

• Resources• individual/institutional/moderate resource

• Target Model• targeted individual/easiest k of N/everyone

• Database access• full (private, public) data access/well-indexed

access/poorly indexed access/hard retrieval/soft retrieval (multimedia)

Goals of Privacy Attacks• Cybercasing (attack preparation)• Cyberstalking• Socio-Economic profiling• Espionage (industry, country)• Cybervetting• Cyberframing

Towards Privacy-Proof MM Systems

• Match users’ expectations of privacy in system behavior (e.g. include user evaluation)

• If that’s not possible, educate users about risks• Ask yourself: What is the best trade-off for the

users between privacy, utility, and convenience?• Don’t expose as much information as possible,

expose only as much information as is required!

Engineering Rules From the Privacy Community

• Inform users of the privacy model and quantify the possible audience:

• Public/link-to-link/semi-public/private • How many people will see the information (avg. friends-

of-friends on Facebook: 70k people!)

• If users expect anonymity, explain the risks of exposure

• Self-posting of PII, hidden meta-data, etc.• Provide tools that make it easier to stay (more)

anonymous based on expert knowledge (e.g. erase EXIF)

Engineering Rules from the Privacy Community

• Show users what metadata is collected by your service/app and to whom it is made available (AKA “Privacy Nutrition Label”)

• At the least, offer an opt-out!• Make settings easily configurable (Facebook is not

easily configurable)• Offer methods to delete and correct data• If possible, trigger search engine updating after deletion• If possible, offer “deep deletion” (i.e. delete re-posts, at

least within-system)

Closing Thought Exercise: Part 1Take two minutes to think about the following questions:• What’s your area of expertise? What are you

working on right now?• How does it interact with privacy? What are the

potential attacks and potential consequences?• What can you do to mitigate negative privacy

effects?• What can you do to educate users about possible

privacy implications?

Closing Thought Exercise: Part 2• Turn to the person next to you and share your

thoughts. Ask each other questions!• You have five minutes.

AcknowledgmentsWork together with: • Jaeyoung Choi, Luke Gottlieb, Robin Sommer,

Howard Lei, Adam Janin, Oana Goga, Nicholas Weaver, Dan Garcia, Blanca Gordo, Serge Egelman, and others

• Georgios Petkos, Eleftherios Spyromitros-Xioufis, Adrian Popescu, Rob Heyman, Georgios Rizos, Polychronis Charitidis, Thomas Theodoridis and others

Thank You!Acknowledgements: • This material is based upon work supported by the

US National Science Foundation under Grant No. CNS-1065240 and DGE-1419319, and by the European Commission under Grant No. 611596 for the USEMP project.

• Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding bodies.

Multimedia Privacy

Data & Analytics

Transcript of Multimedia Privacy