Privacy amp Social Media
Lisa Singh PhD Department of Computer Science Georgetown University
Outline
bull Our world on the Internet bull Data privacy in a public profile world bull Methods for determining our web footprints
bull Taking control of our web identities
Our presence on the Internet and social media
72 Billion People in the World
35 Billion Have a Mobile
Device
50
3 Billion Use the Internet 42
2 Billion Use Social Media
29
Data so much datahellip
Users share 70 billion pieces of content each month on Facebook
190 million tweets are sent per day
65 hours of video are uploaded to YouTube every minute
Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality
Privacy settings and social media
bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)
bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)
bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Outline
bull Our world on the Internet bull Data privacy in a public profile world bull Methods for determining our web footprints
bull Taking control of our web identities
Our presence on the Internet and social media
72 Billion People in the World
35 Billion Have a Mobile
Device
50
3 Billion Use the Internet 42
2 Billion Use Social Media
29
Data so much datahellip
Users share 70 billion pieces of content each month on Facebook
190 million tweets are sent per day
65 hours of video are uploaded to YouTube every minute
Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality
Privacy settings and social media
bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)
bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)
bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Our presence on the Internet and social media
72 Billion People in the World
35 Billion Have a Mobile
Device
50
3 Billion Use the Internet 42
2 Billion Use Social Media
29
Data so much datahellip
Users share 70 billion pieces of content each month on Facebook
190 million tweets are sent per day
65 hours of video are uploaded to YouTube every minute
Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality
Privacy settings and social media
bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)
bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)
bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Data so much datahellip
Users share 70 billion pieces of content each month on Facebook
190 million tweets are sent per day
65 hours of video are uploaded to YouTube every minute
Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality
Privacy settings and social media
bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)
bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)
bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Privacy settings and social media
bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)
bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)
bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Consequences of Over-sharing
bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences
bull Enabling of snoopers
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Data Privacy Expectations
bull We should expect data privacy
bull We should expect freedom from unauthorized use of our data
bull We should expectfreedom from data intrusion
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
How informative linkable or sensitive is your public profile ndash your web footprint
Gay
Georgetown University
Washington DC
Software Developer
John Smith
John Smith
Divorced
Spanish-speaking
Department of Defense
Republican
Catholic
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Your name
Lisa Singh Micah Sherr
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Linking dataFacebook
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist
Google+
First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033
Adversaryrsquos Beliefs
First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
What about friends
Startinguser
Listofnamesoffriends
Listofnamesoffriendsforgiven
user
match =numberoverlappingfriendsbetweenusers
site1 site2
[Ramachandran etal2012]
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
WebFootprint
A1A2A3A4A5A6
A1A2
JohnDoe
A3A4
JohnDoe
A5A6
JohnDoe
Really linking data
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Shared Public Attributes
Google+
bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus
bull Genderbull GraduationYear
bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages
FourSquare
bull Facebookidbull Twitterhandle
bull Emailbull Genderbull Locationbull Phonenumber
bull Relationshipstatus
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
What do group memberships tell us
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
What about tweets
bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears
[Singhetal2015]
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
bull Birthdaybull Genderbull Addressbull Educationbull Hobbies
bull Skillsbull Titlebull Industrybull Educationbull Experience
bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe
leveragedtodeterminetheundisclosedattributesofauser
What about the population
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Methodology
bull Sample user profiles from media sites
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
ab rarr cacd rarr ead rarr b
bcdf rarr a
Public Profiles
Step 1Subpopulation
Sampling
Step 2Inference Engine
Construction
UserProfile
InferenceModel
HiddenAttribute-Values
Step 3Determination of Hidden
Attribute-Values
InferenceEngine
InferenceModel
Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)
the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i
To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi
ID and (2) either vij = Pij if Pi
j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary
IV ATTRIBUTE INFERENCE METHODOLOGY
Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user
We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values
A Subpopulation Sampling
The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API
B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values
There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary
To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build
LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows
Each profile idk vk1 vk2 vkq in the subpopulation D0
consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2
over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of
2httpwwwlemurprojectorg3httplemurprojectorgclueweb09
bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
0
3
6
9
12
15Inferencegain
Inferencegain
What can be inferred from the population
[Mooreetal2013]
LinkedIndataset91150publicprofiles12attributesperprofile
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Web Footprinting
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Experiments for Understanding Public Profiles
Aboutme - personal website hosting site Each user can make a custom
webpage about themselves Can list links to their social
media profiles on multiple websites
Using their API we collected 124497 peoples information -gt Ground Truth
21
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles
[Singhetal2015]
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
23
Synonyms can be found
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Dbpedia
Synonyms
Meronym
24
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Using an Ontology
25
Approximately8000attributeswerematchedupfromtheontology
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Taking Control of Our Web Identity and Data
1 Keep your public profile professional2 Change all your social media account settings that have
personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts
eg use different account user names share different information etc
6 Install ad blockers to reduce data about your click through habits
7 Set your browser to not accept cookies from sites that you have not visited before
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
The world around us
DATAFICATION
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Data Ethics
bull Regulationndash We need to hold companies to higher standards
bull Data ethics standardsndash We need discussion debate and possibly a new discipline
bull Catalog of personal datandash Individuals should be able to see correct andor remove
data companies have about them
[Singh2016]
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Final Thoughts
bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for
generating web footprints It is too easybull We need new ways to help users understand what data can be
determined about them and help them take control of their information
bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data
bull We need to develop guidelines and regulations that protect users
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
We need to take back control of our data
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission
A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited
L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015
L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015
W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013
J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013
F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012
A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
The Team amp Support
bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan
bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang
Yanan Zhu
bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad
Moore Andrew Hian-Cheong Janet Zhu
Support National Science Foundation
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
5 Reasons to Join Our Program
1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year
2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell
asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand
governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty
NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)
ApplicationdeadlineApril1
Top Related