Crowdsourced Exploration of Security Conﬁgurations · We advocate an efﬁcient...

Crowdsourced Exploration of Security Configurations

Qatrunnada Ismail† Tousif Ahmed† Apu Kapadia† Michael K. Reiter‡†School of Informatics and Computing ‡Department of Computer Science

Indiana University Bloomington University of North CarolinaBloomington, IN, USA Chapel Hill, NC, USA{qismail, touahmed, kapadia}@indiana.edu, [email protected]

ABSTRACTSmartphone apps today request permission to access a mul-titude of sensitive resources, which users must accept com-pletely during installation (e.g., on Android) or selectivelyconfigure after installation (e.g., on iOS, but also planned forAndroid). Everyday users, however, do not have the ability tomake informed decisions about which permissions are essen-tial for their usage. For enhanced privacy, we seek to leveragecrowdsourcing to find minimal sets of permissions that willpreserve the usability of the app for diverse users.

We advocate an efficient ‘lattice-based’ crowd-managementstrategy to explore the space of permissions sets. We con-ducted a user study (N = 26) in which participants exploreddifferent permission sets for the popular Instagram app. Thisstudy validates our efficient crowd management strategy andshows that usability scores for diverse users can be predictedaccurately, enabling suitable recommendations.

Author KeywordsAndroid apps; crowdsourcing; permissions; privacy

ACM Classification KeywordsH.5.m Information interfaces and presentation (e.g., HCI):Miscellaneous; D.4.6 Software: Security and Protection—Access Control

INTRODUCTIONUsers of mobile devices such as smartphones and tablets areable to install apps from online marketplaces that feature mil-lions of apps.1 These apps can potentially access various sen-sors (e.g., accelerometer, microphone, GPS, and camera) anduser data (e.g., contact information and photos). In general, itis difficult for the operating system (such as Android or iOS)to ascertain whether an app is making legitimate or nefarioususes of the sensors or data. Instead, during the installation1http://www.appbrain.com/stats/number-of-android-apps/, http://toucharcade.com/2014/06/02/new-ios-8-app-store-features/

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the Owner/Author.

Copyright is held by the owner/author(s).CHI 2015, Apr 18–23, 2015, Seoul, Republic of KoreaACM 978-1-4503-3145-6/15/04. http://dx.doi.org/10.1145/2702123.2702370

of an app, the Android operating system displays all the re-quested permissions, which the user either approves or deniesas shown in Figure 1a. On iOS, users initially install apps andthen are able to selectively enable or disable privacy sensitivepermissions as shown in Figure 1b. Google has been experi-menting with a similar feature (“App Ops”) that allows usersto selectively disable specific permissions, but it has not yetbeen released officially.2

(a) Android

(b) iOS

Figure 1. Control panels related to application permissions.

Ideally users would determine the fewest permissions neededfor an app to support their regular usage of the app (i.e., the‘principle of least privilege’ [14]). For example, Angry Birdsrequests access to location, but may work well for most usersif access to location is disabled. While several research pro-posals have been made to allow more granular control or de-tect anomalous uses of such sensors (Bugiel et al. [2] providea comprehensive overview of related work on Android OSsecurity), it remains unclear how users can decide which per-missions are useful for the functioning of the app. Felt etal. [4] proposed an automated technique to determine whichpermissions are not used by an app, but they did not addressthe relative importance of each permission that is used by theapp. While users can provide feedback through ratings andcomments on App marketplaces like Apple’s App Store andthe Google Play Store, no feedback or ratings are availableon various configurations (i.e., with various combinations of2https://www.eff.org/deeplinks/2013/12/google-removes-vital-privacy-features-android-shortly-after-adding-them/

http://www.appbrain.com/stats/number-of-android-apps/

http://www.appbrain.com/stats/number-of-android-apps/

http://toucharcade.com/2014/06/02/new-ios-8-app-store-features/

http://toucharcade.com/2014/06/02/new-ios-8-app-store-features/

https://www.eff.org/deeplinks/2013/12/google-removes-vital-privacy-features-android-shortly-after-adding-them/



permissions disabled) of the app. Indeed, there are too manypossible configurations of an app, and it is not clear if it feasi-ble to provide meaningful feedback about all configurations.

We advocate a crowdsourced approach that leverages humansto efficiently assess various security configurations for an appand identify app configurations that strike tradeoffs betweenprivacy and usability suitable to each user. After gaining fa-miliarity with an app, each crowd member would volunteerto utilize a small number of configurations for that app overa test period. Based on the usability scores of configurationsgiven by a specific user, similar crowd members would beidentified and then used to recommend configurations suit-able to that user. We envision that our system can be easilyintegrated into popular app marketplaces such as the AppleApp Store and the Google Play Store. Instead of rating onlyone configuration of an app, smartphone users would test andrate different configurations of apps. Then our system wouldrecommend appropriate configurations to users based on theirsimilarities to other users in the marketplace, i.e., the crowd.

The proposed approach, however, poses challenges. First, asthe number n of permissions requested by an application in-creases, the number of possible permission sets increases ex-ponentially (2n). Thus a scalable approach is needed to col-laboratively explore the permission space. Second, even ifthese configurations can be collaboratively explored, it is notclear whether suitable recommendations can be made to a di-verse set of users. A permission set that is acceptable to someusers may not be acceptable to others. Finally, unobtrusivelydeploying and assessing configurations would benefit frommeasures of ‘usability’ that do not involve user intervention.

Research questions. This research seeks to leverage crowd-sourcing to efficiently find a set of permissions for each userinstalling an app that strikes an acceptable balance betweenapp usability and user privacy (by finding the permissions thatare infrequently used by users or permissions that users donot mind losing). More specifically, we seek to answer thefollowing research questions:

R1: Can we use crowdsourcing scalably to explore securityconfigurations of an app? Using crowdsourcing, we seekto obtain users’ perspectives on the usability of the appwhen certain combinations of permissions are disabled.Based on the intuition that usability scores cannot gen-erally increase when additional permissions are removed,we develop a ‘lattice-based’ approach to prune the searchspace of permission sets and evaluate its efficacy.

R2: Can we recommend suitable permission sets based onthe crowd’s ratings? We would like to predict the suitabil-ity of various app configurations for users based on ratingsobtained from the crowd (other users). In particular we ex-plore whether collaborative filtering can be used to makesuch predictions effectively.

R3: What proxies can be used to determine the usability ofan app without asking users for ratings explicitly? We seekto find if we can ascertain the ‘usability’ of an app in au-tomated ways, such as measuring the time spent by a user

using an app or logging any problems s/he encounters. Au-tomated measures require fewer user interruptions and socould permit a less obtrusive exploration of the state space.

Our contributions. Before the suggested crowdsourcing strat-egy can be applied in general, we seek to validate our lattice-based approach and the viability of making accurate predic-tions of usability scores. As a first step, we perform a userstudy (N = 26) on one specific, but popular, app (Instagram).Participants tried different configurations of this app in whicheach configuration had specific permissions removed. Partic-ipants reported their feedback about each configuration andanswered exit surveys at the end of the study.

We found that: 1) as more permissions were removed, thereported usability scores dropped (or did not increase) sig-nificantly, validating the lattice-based approach; 2) based onusability scores given by the crowd, we can predict partici-pants’ individual ratings for various subsets of permissions,and thus make suitable recommendations; and 3) to measureusability, instead of asking users for explicit feedback, otherattributes such as the usage time of an app and the number ofproblems encountered are suitable proxies for usability.

RELATED WORK

CrowdsourcingCrowdsourcing has been used in many studies related to An-droid privacy and security. Amini et al. [1] developed App-Scanner, which analyzes and learns application behaviors tofind privacy-related issues. In their crowdsourcing step, theyask users for their reactions to behaviors uncovered by theirtool. Crowdsourcing was also used by Lin et al. [10] tocapture users’ expectations about privacy-related behaviorsof Android applications. They used their findings to designa technique to inform Android users about the way permis-sions are used by an application. However, it may not beclear to users how removing certain permissions may affectthe usability of the app. In our crowdsourcing step, instead ofonly asking users about their expectations, we provide themwith different configurations of applications (with differentpermissions removed) and get users’ feedback based on theirexperience with those configurations.

Studying the Effects of Permission RemovalSince users do not have fine-grained control over what spe-cific permissions they grant apps in Android, researchers haveproposed systems to give users the ability to do so. For exam-ple, Nauman et al. [12] modified Android with an advancedapplication installer that enables users to allow or deny eachindividual permission requested at installation. Rather thanasking users to evaluate permissions at installation, we in-stead ask them to evaluate an application with selected per-missions disabled. We then use their feedback as a basisfor performing collaborative filtering to benefit other users.Moreover, our method of disabling permissions does notmodify Android.

There are also studies about the effects on applications’ per-formance after removing permissions. Notably, Kennedy etal. [9] proposed a system that removes permissions from An-droid applications to determine which removed permissions

cause the app to crash. Their system, however, focuses on anautomated process for evaluating the run-time effects of re-moving permissions and does not assess the impact on usersbased on real-world usage, as we do.

App Ops, a feature in Android v4.3, allowed users to selec-tively disable permissions for apps on their phones. However,Google removed this feature in the next update, reporting thatit was experimental and could cause apps to behave in un-expected ways. For our approach, we assume the apps be-ing explored have reasonable exception handling. We expectthat App Ops-like functionality will eventually be released forAndroid (as it is already available in iOS) and that most appswill handle permissions restrictions gracefully.

Users’ Perspectives about PermissionsFelt et al. [5, 6] and Kelly et al. [7] showed that the currentway of displaying permissions is not clear to users and noteffective at informing them about the potential risks of in-stalling an app. Kelly et al. [8] thus proposed a more effectiveway to display the permissions and found it to be helpful forusers to make better decisions with respect to their privacy.Rather than asking users directly if they think a specific per-mission is needed, we let users test app configurations withsome permissions removed and report their feedback. Weexpect a combination of these approaches would be useful,whereby useful suggestions through crowdsourcing can im-prove information provided in the installation interface.

Collaborative FilteringRecommender systems are used to predict ratings and to cre-ate personal recommendations in various domains. By apply-ing statistical methods and knowledge discovery techniques,it is possible to recommend products based on user prefer-ences [15]. Collaborative filtering is a recommendation tech-nique that is used for predicting a user’s ratings for unknownproducts based on the user’s previous ratings and aggregatingother (similar) users’ ratings. Two common techniques forcollaborative filtering are ‘user-based’ collaborative filteringand ‘item-based’ collaborative filtering. In user-based col-laborative filtering, a user’s rating for any particular item ispredicted from ratings by similar users for that item [13]. Initem-based collaborative filtering, ratings for an item are pre-dicted from ratings for similar items [16]. In this paper, weincorporate the idea of user-based collaborative filtering intothe mobile app privacy domain and use it for predicting thesuitability of various permission sets.

APPROACHIn this section, we describe our proposed crowdsourcing ap-proach, usability metrics, and recommender system. More-over, we present the hypotheses that we test in our user study.

Lattice Based Approach for Crowd ManagementWe propose an approach whereby crowd members strategi-cally explore the various permission sets of an app in a waythat is scalable with the number of permissions requested bythe app. Our main hypothesis here is that as the set of per-missions removed grows, the ‘usability’ of the app does not

increase. (We defer the discussion about how exactly we mea-sure ‘usability’ to the next subsection.) Intuitively, if a setof permissions is made more restrictive, we expect the num-ber of features of the app to also be reduced and thus resultin a lower or equal rating.3 For example, if a user thinksthat a specific app is unusable when location permission isremoved, s/he is likely to think the same (or worse) if the lo-cation and camera permissions are removed together. Thisintuition suggests an approach of a structured exploration ofthe permission space as we formalize next.

The permission space can be represented by a lattice struc-ture. Consider a lattice structure as shown in Figure 2 for anapp that requests four permissions. The nodes in the latticerepresent the sets of removed permissions. The null vertexrepresents the original app where there are no permissions re-moved, so it is not shown in the figure. Each level in the lat-tice going upwards represents an increasing number of per-missions removed. So, the first level represents nodes withone permission removed, the second level has nodes with twopermissions removed, and so on.

{1} {2} {4}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2,3,4}

{3}

score: 4.5 score: 5 score: 4.2 score: 3

score: 4.2

score: 4

score: 2.5

Threshold= 4: pruned from the 1st level: pruned from the 2nd level

Level 1

Level 2

Level 3

Level 4

Figure 2. The lattice structure where each node represents specific per-missions that are removed.

We suggest the following crowd-based exploration strategy:First, the crowd explores the first level in the lattice and theaverage usability score is calculated for each node. If thescores of any of these nodes fall below a certain threshold,their ancestors can be pruned and not explored because thescores are expected to be non-increasing. For example, if weconsider the lattice in Figure 2, after testing all four nodesin the first level, the node {4} has a score below the thresh-old. So, we prune all of this node’s ancestors. This leaves uswith fewer nodes to test as we proceed upwards. After testingthe nodes {1,2},{1,3} and {2,3}, we may find that the av-erage score for {1,3} falls below the threshold, so we pruneits ancestor {1,2,3}. This approach enables us to explore the3Of course, this is not true as a rule. For example, removing one per-mission alone might cause the app to consistently crash, though inaddition removing a second permission might prevent the app fromever attempting to use the first (and so from crashing). Though an ar-tificial example, it shows that removing permissions need not mono-tonically decay the usability of the app. That said, recall that weassume apps will handle the removal of permissions gracefully; pre-sumably this should already be true for iOS apps and will increas-ingly become so for Android apps once App Ops is (re)released.

permission space efficiently and find suitable configurationsof permissions based on the crowd’s reports, i.e., nodes {1,2}and {2,3} in our example. This technique can be more ef-ficient than exploring the whole space, especially when thenumber n of permissions is large.

Before this strategy can be applied in practice, it is importantto ascertain whether the presumed lattice-based relationshipof usability scores holds. As a first step, we evaluate the fol-lowing hypothesis for a specific application (Instagram):

H1: The usability scores of the nodes in the lattice are non-increasing as we proceed upwards in the lattice and removemore permissions.

Measuring UsabilityIn this paper we use the term “usability” to refer to how ‘ac-ceptable’ users consider an application to be when some of itsadvertised functionality has been intentionally restricted. Weobtain this ‘ground truth’ information by asking participantsabout their level of acceptance of using a configuration of theapp (until the next major app update) with specific featuresdisabled. We also ask them to rate the different configurationscompared to the original app in which no permissions are re-moved. In our envisioned crowdsourcing system, we seek tomeasure the usability of an app indirectly without user inter-vention by using indirect measures that can serve as a proxyfor usability. Those measures include the time spent using anapp and whether there are problems found while using it.

Therefore, we have the hypotheses mentioned below. We alsotest whether the ‘ratings’ provided by users agree with the‘acceptability’ scores (lending validity to using ‘acceptabil-ity’ as a ground-truth measure):

H2: Under the assumption that the original app is perceivedas usable by users, a user’s rating of a customized con-figuration of the app as better or worse than the originalis a good predictor for the whether the user finds the cus-tomized configuration acceptable.

H3: The time spent using a customized configuration of theapp can be used as a proxy to measure its usability. Thismeans that participants who use a particular configurationlonger than another configuration find the former configu-ration more acceptable.

H4: Whether users find problems in a customized configura-tion of an app can be used as a proxy to measure its usabil-ity. Participants who identify more problems in a configu-ration of an app will rate it as less usable than a configura-tion for an app where fewer problems are encountered.

Recommender SystemOur goal is to leverage the crowd-based exploration of thepermission subsets to recommend a suitable one for a particu-lar user. We advocate a ‘user-based collaborative filtering’ ap-proach [13] to recommend permission subsets to users basedon their similarity (of ratings) to other users in the crowd.

The user-based collaborative filtering computes the predictedratings for user u ∈ U on all ‘items’ ∈ I where the itemsrepresent permission subsets that u did not try. The algorithmfirst identifies the most similar users to u based on similarity

in their ratings for the common items that they have tried.Those users are denoted by k-nearest neighbors where k isthe number of those similar users. The predicted rating foruser u on any item j ∈ I is then computed as [3]:

pu,j = r̄u +

∑v∈Neighbors(u) s(u, v)(rv,j − r̄v)∑

v∈Neighbors(u) |s(u, v)|

where r̄u is the mean rating for user u (across all items ratedby that user), Neighbors(u) is the list of u’s similar users,rv,j is the usability rating of user v for permission set j (apositive integer in our case), and s(u, v) is the ‘similarity’between user u and his neighbor v. This formula calculatesthe difference between each neighbor v’s rating for item j andv’s mean rating for all items, takes the weighted mean of thisdifference across neighbors (using a similarity measure as theweight), and then applies this difference to u’s mean rating.We give an example in the Findings section (see Table 3).

The similarity measure s(u, v) can be computed, e.g., usingthe ‘cosine similarity’ of the users’ rating vectors, which isshown in the formula bellow. ru and rv are users’ u and vratings vectors and the unknown ratings are replaced by 0.

s(u, v) =ru · rv‖ru‖ ‖rv‖

=

∑ni=1 ru,irv,i√∑n

i=1 r2u,i

√∑ni=1 r

2v,i

We propose a specialized approach based on the lattice struc-ture. For example, we may want to recommend as restrictivea permission set as possible (higher up in the lattice, whichmeans enhanced privacy) as long as the usability score doesnot fall below a certain threshold. Furthermore, the lattice-based structure can be leveraged to improve the accuracy ofour predictions given the relationship between nodes in thelattice. As hypothesized in H1, as we remove more permis-sions, the usability scores of the apps are non-increasing. So,if the predicted score of any node in the lattice is greater thanany of its children scores, we set the predicted score for thatnode to the minimum score of the children nodes. For exam-ple, as we can see from the lattice depicted in Figure 2, node{1, 2} has two children {1} and {2}. If the predicted scorefor node {1, 2} is 5.5, our algorithm sets it to 4.5 which is theminimum score of node {1} = 4.5 and {2} = 5. We referto this approach as the “clamping technique”, since the rec-ommended score is clamped to the highest value allowable bythe user’s own scores in the lattice.

We evaluate how accurately the suggested collaborative filter-ing approach can be used to predict usability scores for users,and thus the viability of a crowdsourcing-based approach foridentifying configurations of applications that balance usabil-ity and privacy for individual users.

STUDY METHODOLOGY

System ImplementationTo answer our research questions, we conducted a user studyin which we focused on Instagram, which is a popular socialnetworking and photo/video sharing Android app. Through-out the paper, we refer to Instagram as the “test app”. In the

study, participants installed an Android app called “the rat-ings app”, to provide them with different configurations ofthe test app and collect their feedback on each configuration.

Choice of the test app. The test app, Instagram, is aphoto/video sharing and social networking app. We chosethis specific app because it is one of the popular free Androidapps in the Google Play store. Besides being popular, the testapp requests several permissions categorized as “dangerous”(according to Android’s classification),4 providing access tosensitive user data such as their location and the camera. Aspermissions in Android apps are divided into four differentlevels depending on their risk level, we focus on the permis-sions in the ‘dangerous’ level. Permissions on this categoryare displayed to users upon installation of an app and need tobe approved to complete the installation process [5]. In ourstudy, we focused on four such permissions providing accessto: (1) location, (2) contacts, (3) camera, and (4) microphone.

Customization of the test app for exploration. We used theAndroid-apktool5 to unpack the test app’s Android applica-tion package file (apk file) and access the app’s manifest file.The manifest file includes the list of requested permissions.We then created different customized configurations of thetest app by removing one or more permissions from the man-ifest file and then recompiling and repackaging the apk files.We did not have access to the the test app’s code itself, andwe made no modifications to the application code.

We created all possible configurations of the app by removingone or more of the four permissions listed above. This gaveus a total of 15 customized configurations of the test app. Theconfigurations are shown in Figure 4.

[email protected]

Figure 3. The ratings app homescreen

The ratings app. When par-ticipants installed the rat-ings app on their phones,they were prompted to in-stall one of the customizedconfigurations of the testapp after they removed theoriginal app. After in-stalling a customized con-figuration of the test app,participants had two days touse it; a timer displayed thetime left to request the nextconfiguration of the test appas shown in Figure 3. Also,participants were allowedto ask for a different configuration of the test app at anytime before the timer ended, but with a $0.50 deduction fromtheir overall compensation. Moreover, the amount of moneyearned in the study was displayed by the ratings app as shownin Figure 3. As such, our compensation scheme (described inmore detail in the Study Procedure Section) motivated partic-ipants to think carefully before requesting a new configura-tion.4http://developer.android.com/guide/topics/manifest/permission-element.html5https://code.google.com/p/android-apktool/

Collecting participants’ feedback. At the end of the two-dayperiod for exploring a configuration of the test app, partic-ipants received notifications to both their phones and emailto remind them to visit the ratings app and request the nextconfiguration of the test app. When participants requested anew configuration by clicking on the reset button, a surveydialog sought their feedback about the configuration that theyjust used. The survey questions are shown in the Appendix,which we summarize here:

Q1. We asked participants about whether they encounteredproblems in the configuration they just used.

Q2. We listed the disabled features (based on the removedpermissions) and asked participants to rate, on a 7-pointLikert scale, how acceptable they found this configuration(if they had to use it until the next major app update).

Q3. We asked participants to rate the test app’s configurationcompared to the original one in which there were no per-missions removed.

Q4. We asked participants to estimate the amount of timethey used that configuration of the test app.

After answering the questionnaire, participants were given adifferent configuration of the test app and the timer was reset.

Back-end. When participants asked for a different configu-ration of the test app, the ratings app connected to a remoteserver in which participants’ answers to the questionnaire de-scribed above were recorded and a customized configurationof the test app was retrieved. One of the customized config-urations was chosen at random while making sure that it wasnot previously used by the requesting participant. We also seta threshold for the number of times that each configurationwas tested to make sure that we did not end up with some con-figurations used much more than others. Moreover, to makesure that participants actually used the provided configurationof the test app, the Ratings App kept track of whenever par-ticipants used the test app on their phones. This informationwas recorded at the back-end.

Design RationaleMinimizing learning effects. The collaborative filtering ap-proach generates recommendations by analyzing multiple rat-ings from each participant. Thus we needed each participantto test multiple app configurations, as they would in a realsystem. To minimize learning effects due to repeated explo-ration by the same participant, we randomized the app config-urations, and their order, tested by each participant. Thus par-ticipants could not predict whether subsequent configurationswould be more (or less) restrictive, and which permissionswould be removed.

Managing participant workload. In collaborative filtering,more ratings from a single user generally produce better rec-ommendations for that user (e.g., getting better movie rec-ommendations from Netflix as the viewer watches and ratesmore movies). We did not want to induce fatigue in our par-ticipants (e.g., assigning too many configurations to each par-ticipant) and wanted to ensure that participants had enoughtime to explore each configuration to make an informed de-cision. Based on a pilot study with two undergraduate stu-

http://developer.android.com/guide/topics/manifest/permission-element.html

http://developer.android.com/guide/topics/manifest/permission-element.html

https://code.google.com/p/android-apktool/

dents, we determined that 5 configurations over 10 days wasa reasonable workload to meet both goals, and we devised acompensation scheme as discussed in the next section basedon this workload. We note that in our exit survey, 20 out of26 participants said they felt they had enough time to exploreeach of their assigned configurations.

Study ProcedureEnrollment. Participants were required to be at least 18 yearsold, to have lived in the United States for at least five years,to have an Android phone with a data plan, and to already beusing Instagram. Qualified participants reported to our lab toenroll in the study. They were informed of the purpose andprocess of the study. After consenting, participants installedthe ratings app. We provided two demo configurations of thetest app, allowing participants to familiarize themselves withthe ratings app and the study process. We then instructed theparticipants to contact us after completing the study.

Ethical considerations. Our study was IRB-approved, andIndiana University’s General Counsel approved our modifica-tion of Instagram’s manifest file based on a ‘Fair Use’ analy-sis of its license agreement.

End of study questionnaire. At the end of the study, partici-pants were given a questionnaire that asked for their technicalbackgrounds, demographics, and feedback. It also includedquestions related to assessing privacy attitudes.

Compensation and incentive scheme. Study participants werepaid up to $25 upon the completion of the study. The studywas designed to last approximately 10 days and participantsearned $2 per day. However, this amount could be affected inthe following situations:

• If the participant was supposed to reset the test app at theend of the two-day period but did not do so for a whole day,s/he did not earn the $2 for that day. We used this as an in-centive to motivate participants to continue trying differentconfigurations of the test app throughout the study.• If the participant reset the test app before the end of the

two-day period in which s/he was to use the test app, $0.50was deducted from his/her overall balance. We used thispenalty to make participants think twice about whether thecurrent configuration was tolerable. So, if a participant wastrying to use a feature that had been disabled, whether s/heasked for a new configuration and lost $0.50 or waited de-pended on how important this feature was to him/her.

At the end of the study, if no money was deducted from theparticipant’s balance, s/he earned $20. An extra $5 was addedif s/he answered the end-of-study questionnaire, which re-sulted in a maximum compensation of $25.

FINDINGS

ParticipantsWe recruited our participants from Bloomington, IN, USA(a college town that is home to Indiana University) by usingflyers, online university classifieds advertisements, and stu-dent mailing lists. The study lasted for two months (Feb–Mar2014). Overall, 30 participants enrolled in our study, 4 of

{1} {2} {4}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2,3,4}

{3}

M :3.73 SD:1.86 #u:11

M :4.89 SD:1.59 #u:9

M :5.63 SD:1.44 #u:9

M :3.22 SD:2.04 #u:9

M :4.78 SD:1.47 #u:9

M :3.7 SD:1.68 #u:10

M :2.2 SD:1.33 #u:10

M :4 SD:1.22 #u:8

M :2 SD:0.71#u:8

M :2.5 SD:1.22#u:8

M :3.78 SD:2.25 #u:9

M :2.28 SD:0.88 #u:7

M :2.33 SD:1.5 #u:9

M :2.1 SD:1.58 #u:10

M : 2 SD: 1.41 #u: 11

M: meanSD: standard deviation#u: no of user who tested this version

1: RECORD_AUDIO2: READ_CONTACTS3: ACCESS_FINE_LOCATION4: CAMERA

Level 2M: 3.20

SD:1.67 #u:53

Level 3M: 2.63

SD:1.80 #u:35

Level 4M: 2

SD:1.41 #u:11

Level 1M: 4.34

SD:2.00 #u:38

Figure 4. The lattice structure where each node represents specific re-moved permissions. The red arrows indicate slight increases in scoreswhich are not statistically significant.

which did not complete the study (testing fewer than 5 con-figurations) so we removed their ratings. Therefore, in ouranalysis, we considered the 26 participants who finished thestudy and we ended up with a total of 137 ratings. In our sam-ple, most of the participants (22 participants) tested 5 config-urations of the test app, 3 participants tested 6 configurationsand 1 participant tested 9. For the participant who tested 9,he/she repeatedly received a configuration with the camerapermission removed and thus requested more configurationsthan most. Each configuration of the test app was rated by atleast 8 participants and at most 11 participants.

Among these 26 participants, 15 were females (57.7%) and11 were males (42.3%). All but one were students. Seven(26.9%) were Informatics or Computer Science majors, whilethe rest came from diverse, non-technical majors.

Analysis of the Lattice StructureFigure 4 shows the average acceptance score (M), standarddeviation (SD), and number of participants (#u) per test appconfiguration as well as aggregated per level in the lattice.Our main hypothesis about the acceptance scores (as a us-ability measure)—i.e., that they are generally not increasingas we remove more permissions—is evidenced in Figure 4.

Permissions Removed

Acce

pta

nce

1 2 3 4

12

34

56

7

Figure 5. Scatter plot showing theeffects on acceptance scores as weremove more permissions.

Figure 5 also shows thatas we remove more per-missions, the acceptancescores tend to decreaseor stay the same. Thebig circles in Figure 5indicate whether the par-ticipant encountered aproblem or not (Q1 inthe survey).

To test our first hypoth-esis (H1), we consid-ered using a multi-levelmodel such as the linear mixed effects model which takes into

account the effects of having users contributing multiple datapoints. However, the homoscedasticity assumption is violatedin our data because the variances of the acceptance scores ofdifferent permission subsets are not equal; therefore, we usednon-parametric tests. Our conclusions hold using both the lin-ear mixed effects model and the non-parametric tests but wereport the latter because they are typically more conservative,making fewer assumptions.

We first ran a Kruskal-Wallis test, which is a non-parametricone-way (single factor) ANOVA. This test shows if there isa statistically significant difference between variables but itdoes not provide information about the nature of the differ-ence (increase or decrease). To examine the difference, weran a follow-up post hoc test. We ran the non-parametric ver-sion of the t-test called the Wilcoxon rank-sum test and sincethis is a post-hoc test that involves multiple comparisons, weused the Benjamini-Hochberg method to adjust the p-valuesfrom the test to control the false discovery rate. The result ofKruskal-Wallis test (χ2 = 21.48, df = 3, p < 0.001) indicatesthat there is significant difference in participants’ acceptancescores due to the number of permissions removed. The re-sults of the Wilcoxon rank-sum post-hoc test are shown inTable 1 where the parameter W and the p-value assess thesignificance of the difference (the decrease) between the vari-ables (# of permissions removed).

Table 1 shows that as more permissions are removed, the ac-ceptance scores decline significantly until 4 permissions areremoved (at which point there is no significant difference inthe scores compared to removing 3 permissions). This meansthat as we go from level 3 to 4, the acceptance values areneither increasing nor decreasing significantly. One reasoncould be that we have few data points for level 4, where wehave only one node compared to the other levels. Given thatthe significant difference between levels indicates a signifi-cant decrease, we can reject the null hypothesis and acceptour first hypothesis (H1) that the acceptance scores are non-increasing as we remove more permissions.

Comparison of #permissions removed W p-value

Adjustedp-value

1 vs. 2 1336.0 0.009** 0.015*1 vs. 3 990.0 < 0.001*** 0.002**1 vs. 4 349.0 0.001** 0.002**2 vs. 3 1143.0 0.024* 0.029*2 vs. 4 423.5 0.017* 0.025*3 vs. 4 233.5 0.275 0.275

Statistical significance: *** p < 0.001, ** p < 0.01, * p < 0.05

Table 1. Wilcoxon rank-sum results for the statistical significance of thedecrease in acceptance scores as we remove more permissions.

In addition to the Kruskal-Wallis test, we performed a correla-tion analysis using Spearman’s Rank Coefficient, which mea-sures the tendency for two variables to increase or decreasetogether. While the p-value tells us the significance of thecorrelation between the variables, ρ tells us whether the vari-ables are monotonically increasing or decreasing and the de-gree of monotonicity. Testing the acceptance score versus the

number of permissions removed, we found a significant neg-ative relationship (ρ = −0.39, p < 0.001), indicating that asmore permissions are removed, the participants’ acceptancescores decline; this supports the results of the previous test.

Special cases. Though we have shown that the acceptancescores are non-increasing as we move upward in the lattice,there are some exceptions in our data, shown by dashed rededges in Figure 4. However, based on Wilcoxon rank-sumtests, these increases are not significant (p > 0.05).

Lattice pruning approach. As this is an exploratory study,we focused on four permissions to be able to explore all sub-sets of permissions (all nodes in the lattice). We thus couldtest whether we can use the proposed pruning strategy on thelattice instead of testing all nodes in the lattice individually.Based on our questionnaire, an acceptability score of 4 canbe chosen as the usability threshold, which means any nodewith acceptance value below 4 is not acceptable and thus itsancestors can be pruned. While we have tested only one spe-cific app, these results point to the viability of a lattice-basedapproach for crowd management. This particular lattice sug-gests that configurations that disable access to {contacts},{location}, and {contacts, location} are promising configura-tions that trade off usability and privacy for Instagram. More-over, after the first level, only the {contacts, location} config-uration would have been explored by the strategy.

Rating and Acceptance ScoresIn the Methodology section, we summarized the survey ques-tions that we asked participants after using each configura-tion of the test app. While in Q2 we asked about the par-ticipant’s acceptance of using the app knowing that specificfeatures were disabled, in Q3 we asked participants to ratethe used configuration compared to the original app. Bothquestions measure, in different ways, how much participantslike the app they used. Our data indicates that there is agree-ment between the two variables which means that the higherthe rating, the higher the acceptance score. We used Spear-man’s Rank Correlation test to see if acceptance level (Q2)can substitute for rating (Q3) in analyses (hypothesis H2). Wefound a statistically significant positive relationship betweenthe variables (ρ= 0.84, p < 0.001), and so we can accept H2.

Usability ProxiesTo test our hypotheses regarding whether proxies can be usedto indirectly determine the usability/acceptability of an app,we analyzed the relationships among the other variables andthe acceptance score. Those variables are the time that partic-ipants reported using a specific configuration of the app andwhether participants found a problem related to the disabledpermissions in the app. Those variables correspond to surveyquestions Q4 and Q1. Figure 6a shows the relationship be-tween the acceptance score (y-axis) and the total usage time(x-axis), and Figure 6b shows the relationship between theacceptance score (y-axis) and whether participants found aproblem or not (x-axis). We can see in the figures that partic-ipants who reported using the app as much or more than theoriginal app (total usage values 2 and 3) tended to give higheracceptance scores than those who reported using it less (total

Variable Spearman’sRank Test

WilcoxonRank-Sum Test

Dependent Independent ρ p-value W p-value

Acceptance (Q2) Usage (Q4) 0.47 < 0.001*** 993 < 0.001***Acceptance (Q2) Problem (Q1) −0.64 < 0.001*** 4290 < 0.001***

Usage (Q4) Problem (Q1) −0.49 < 0.001*** 3619 < 0.001***Statistical significance: *** p < 0.001, ** p < 0.01, * p < 0.05

Table 2. Statistical significance results of the relationship between theother usability proxies.

usage value 1). Also, finding a problem affected the scoresnegatively and caused participants to give lower acceptancescores.

To test the significance of these findings, we ran both theWilcoxon rank-sum and Spearman’s Rank Coefficient tests.The results (Table 2) suggest significant relationships be-tween the variables. The first row indicates a significantpositive relationship between the usage time and acceptancescore, i.e., the higher the usage time, the higher the accep-tance score. The second row shows a negative relationship:whenever participants encountered problems, the acceptancescores declined. Since both relationships are statistically sig-nificant (p < 0.001), we can accept both H3 and H4.

Moreover, we analyzed the relationship between the usagetime and finding a problem as shown in Figure 6c. We founda negative significant relationship between them as shown inthe last row of Table 2. This indicates that whenever a prob-lem was encountered, the usage time declined significantly.

Predicting User RatingsWe tested the collaborative filtering algorithm described ear-lier on 137 ratings from 26 participants who, on average,rated five configurations of the test app. We used the accep-tance score, which was recorded on a 7-point Likert scale, asthe usability measure. To calculate accuracy, we iterativelyconsidered one of the participants’ ratings as unknown andtried to predict it. Then we compared the predicted ratingwith the known (ground-truth) rating, specifically calculatingthe mean absolute error (MAE) and root mean square error(RMSE) of the collaborative filtering algorithm.

Before presenting our results, we first provide a concrete ex-ample from our dataset. Table 3 shows ratings for one par-ticipant (‘active user’) for whom the rating for node {1,2} isbeing predicted. The ‘Actual’ column shows the participant’sactual ratings for various configurations, including the ratingfor {1,2} in parentheses, which is treated as the ground truth.In the ‘Nearest Neighbors’ column, we show the ratings ofthe two closest neighbors of the active user based on theircosine similarity of the rating vectors with the active user.Note that this metric normalizes participants’ scores, and soNeighbor 1 is close to the active user despite having generallyhigher scores. The last column of Table 3 shows the predictedrating for node {1, 2}. The predicted rating accounts for vari-ations in the average ratings for individual users and in thiscase results in a prediction of 3.97. Applying the clampingtechnique, we set the prediction to 3 because 3.97 is greaterthan the rating value for the child node {1} in the lattice struc-ture. In this example clamping yields better results.

NodeNearest

NeighborsActive User

Neighbor 1 Neighbor 2 Actual Prediction

{1} 7 3 3{2} 6{4} 1{1, 2} 6 4 ? (2) 3{1, 3} 3 3{2, 4} 2 2{3, 4} 2{1, 3, 4} 2{2, 3, 4} 1

Table 3. Example of the prediction algorithm

# of Nearest NeighborsActual Clamping

MAE RMSE MAE RMSE

1 1.402 2.098 1.312 1.9742 1.375 2.072 1.301 1.9873 1.317 1.968 1.250 1.8964 1.301 1.925 1.236 1.8455 1.285 1.894 1.222 1.8206 1.273 1.872 1.207 1.794

Table 4. Accuracy of the prediction algorithm

The accuracy of our prediction is shown in Table 4. Based onour data set, we can predict the ratings using up to six nearestneighbors; for more than six neighbors the accuracy does notimprove significantly. We can see that the MAE decreases asthe number of neighbors increases. We anticipate that if wehave larger data set then we can increase the number of near-est neighbors, which may result in better MAE and RMSE.

We compared the accuracy of our predictions with the grandaverage (mean of all ratings) (MAE=1.63, RMSE=2.13), anditem averages (mean of each item’s ratings) (MAE=1.397,RMSE=1.925). Our algorithm with the clamping techniqueperforms better than both baselines. Thus we conclude thatcollaborative filtering presents a viable approach for predict-ing the acceptability of different permission sets for this app,and is suitable for recommending acceptable configurationsto users for various privacy levels (e.g., predicting scores forthis app with multiple permissions removed).

DISCUSSION AND LIMITATIONSWe now discuss the limitations of our study and directions forfuture work.

Sample size. Our study’s size (26 participants) was adequateto test various hypotheses with statistical significance but pre-cluded others. For example, we wanted to study whether pri-vacy attitudes affected usability scores and could be used toimprove recommendations. We hope to study such effects ina larger study, possibly by recruiting participants online.

Population. We leveraged a local population since our exper-iment involved an in-person visit to our lab for installation ofthe ratings app and training, which we thought was prudent toensure understanding of our compensation scheme. As a re-

Total Usage

Accepta

nce

1 2 3

12

34

56

7

(a)

Problem

Accepta

nce

No Yes

12

34

56

7

(b)

Problem

Tota

l U

sage

No Yes

12

3

(c)

Figure 6. Scatter plots showing the relationships between the other usability proxies a) total usage time and acceptance scores, b) finding a problem inthe app and acceptance scores, and c) finding a problem and total usage time.

sult, all participants but one were university students. Whileour results do not extend to the general population, only 25%of our participants were technical majors and thus our sam-ple was diverse in this sense. Future work may investigate ifage is an important factor and, if so, examine crowdsourcedexploration based on age groups.

Scalable exploration of different apps. Our study involvedonly one app (Instagram), which limits the generalizability ofour findings. In the future we would like to compare resultsacross various classes of apps such as social media, games,and utilities. While we expect the general pruning strategy towork (based on H1), the lattice may be very large for someapps. To improve scalability, we may monitor which permis-sions are utilized by the user under normal operation (for ex-ample, using the approach by Felt et al. [4]), and then use ourcrowdsourced approach for subsets relevant to those permis-sions. Another possibility is to structure the exploration of anapp based on knowledge gained from the exploration of an-other app. If similar users have already been identified, theseusers can be tasked to explore more diverse configurations, ascommon configurations are expected to add less information.

Installation process. On Android, users can view the permis-sions requested by an app during installation. Whether a par-ticipant read these permissions was unknown to us, thus pos-sibly introducing a confound where not all participants wereaware of the permission set and so were biased differently intheir assessments. So, we ordered the survey questions suchthat participants were first asked to report any problems en-countered in the app (Q1) and then informed of the removedpermissions and asked to rate the app (Q2). Participants thusreflected upon their actual usage of the app immediately be-fore rating it. Our analysis shows a strong correlation be-tween the problems participants encountered with configura-tions and their ratings for those configurations, thus indicatingtheir ratings were based largely on their experience.

Incentivizing the crowd. We envision our approach inte-grated into popular App marketplaces where app ratings can

be enhanced to show the top recommended configurations forusers. Though recruiting and motivating crowd participationto rate apps with some permissions disabled is not our focusin this paper, we expect that the reciprocal benefits that partic-ipation might offer could be a basis for doing so. Popular Appmarketplaces already demonstrate that users are motivated torate apps for such reciprocal benefits.

Failure modes. Apps can behave in various ways when per-missions are selectively disabled, ranging from crashes togracefully handling errors. Our approach is primarily relevantto apps that gracefully handle any resulting errors. However,we note that combinations of permissions that render an Appuseless will be quickly identified by a handful of users and‘pruned’ from further exploration. Thus, our approach neednot inconvenience a large number of users even for apps thatdo not handle permission removals gracefully.

Accuracy of the recommendations. Some users may expe-rience problems unrelated to the removed permissions. Inour study we observed six such data points; removing thesewould have improved the MAE of our recommendations by1.5%. In a real deployment, errors encountered by users couldbe analyzed for relevance to the particular configuration (e.g.,by observing software exceptions). Moreover, like any otherrecommendation system, malicious users could provide fakeratings in favor of or against specific configurations. For ex-ample, Mukherjee et al. [11] analyzed and proposed modelsfor detecting fake product reviews. Although detecting suchratings is out of the scope of this paper, removing those rat-ings would increase the accuracy of the recommendations.

CONCLUSIONThrough a preliminary user study we have shown that it is fea-sible to apply crowdsourcing as a technique to collectively ex-plore various security configurations of apps and find config-urations that make suitable tradeoffs between privacy and us-ability for a diverse set of users. While others have proposedcomplementary techniques that rely on automated analyses orseeking opinions of permission settings from the crowd, we

posit that considering opinions based on actual usage of var-ious configurations would facilitate better recommendationsto users. However, as the space of all possible permission set-tings grows exponentially with the number of permissions, itis imperative to follow a thoughtful crowd-management strat-egy; a goal of this work was to validate our lattice-based strat-egy for crowd management. Given our promising results, wethus advocate for further, and larger scale, research for crowd-sourced exploration of security configurations.

ACKNOWLEDGMENTSThis work was supported by grants 1228364 and 1228471from NSF. Ismail is funded by the College of Computer andInformation Sciences in King Saud University, Saudi Arabia.We thank Steven Armes for earlier work on our software;Shirin Nilizadeh for testing the software; Wesley Beaulieuand Hiroki Naganobori for assistance with statistical analy-sis; and Norman Su and Kelly Caine for valuable feedback.

REFERENCES1. Amini, S., Lin, J., Hong, J. I., Lindqvist, J., and Zhang,

J. Mobile application evaluation using automation andcrowdsourcing. In Workshop on Privacy EnhancingTools (July 2013).

2. Bugiel, S., Davi, L., Dmitrienko, A., Fischer, T.,Sadeghi, A.-R., and Shastry, B. Towards tamingprivilege-escalation attacks on Android. In 19th Network& Distributed System Security Symposium (Feb. 2012).

3. Ekstrand, M. D., Riedl, J. T., and Konstan, J. A.Collaborative filtering recommender systems.Foundations and Trends in Human-ComputerInteraction 4, 2 (Feb. 2011), 81–173.

4. Felt, A. P., Chin, E., Hanna, S., Song, D., and Wagner,D. Android permissions demystified. In 18th ACMConference on Computer and Communications Security(2011), 627–638.

5. Felt, A. P., Greenwood, K., and Wagner, D. Theeffectiveness of application permissions. In 2nd USENIXConference on Web Application Development (2011),75–86.

6. Felt, A. P., Ha, E., Egelman, S., Haney, A., Chin, E., andWagner, D. Android permissions: User attention,comprehension, and behavior. In 8th Symposium onUsable Privacy and Security (2012), 3:1–3:14.

7. Kelley, P. G., Consolvo, S., Cranor, L. F., Jung, J.,Sadeh, N., and Wetherall, D. A conundrum ofpermissions: Installing applications on an Androidsmartphone. In Financial Cryptography and DataSecurity. Springer, 2012, 68–79.

8. Kelley, P. G., Cranor, L. F., and Sadeh, N. Privacy as partof the app decision-making process. In 2013 ACMConference on Human Factors in Computing Systems(2013), 3393–3402.

9. Kennedy, K., Gustafson, E., and Chen, H. Quantifyingthe effects of removing permissions from Androidapplications. In IEEE Mobile Security Technologies(2013).

10. Lin, J., Sadeh, N., Amini, S., Lindqvist, J., Hong, J. I.,and Zhang, J. Expectation and purpose: Understandingusers’ mental models of mobile app privacy throughcrowdsourcing. In 2012 ACM Conference on UbiquitousComputing (2012), 501–510.

11. Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M.,Castellanos, M., and Ghosh, R. Spotting opinionspammers using behavioral footprints. In 19th ACMConference on Knowledge Discovery and Data Mining(2013), 632–640.

12. Nauman, M., Khan, S., and Zhang, X. Apex: ExtendingAndroid permission model and enforcement withuser-defined runtime constraints. In 5th ACMSymposium on Information, Computer andCommunications Security (2010), 328–332.

13. Resnick, P., Iacovou, N., Bergstrom, M. S. P., and Riedl,J. T. GroupLens: An open architecture for collaborativefiltering of netnews. In ACM Conference on ComputerSupported Collaborative Work (1994), 175–186.

14. Saltzer, J. H., and Schroeder, M. D. The protection ofinformation in computer systems. Proceedings of theIEEE 63, 9 (Sept. 1975).

15. Sarwar, B., Karypis, G., Konstan, J., and Riedl, J.Analysis of recommendation algorithms fore-commerce. In 2nd ACM Conference on ElectronicCommerce (2000), 158–167.

16. Sarwar, B., Karypis, G., Konstan, J., and Riedl, J.Item-based collaborative filtering recommendationalgorithms. In 10th International Conference on WorldWide Web (2001), 285–295.

APPENDIXHere we list the survey questions in the ratings app. In thesequestions, we referred to a configuration of the test app withsome permissions removed as a “version” of the test app, aswe feared that “configuration” might confuse participants.

Q1. Were there any features that did not appear to work inthis version of the test app? • Yes • NoIf (Yes), please provide examples of features that did not work(select all that apply): • Location based features • Camerabased features • Microphone based features • The app couldnot access my contacts • Others:[ ]Q2. The following features were removed from the test appfor this version: • Camera • LocationWould you find it acceptable to use the test app with thesefeatures disabled until the next major app update fromthe test app- 1) Totally unacceptable 2) Unacceptable3) Slightly unacceptable 4) Neutral 5) Slightly accept-able 6) Acceptable 7) Perfectly AcceptableQ3. How would you rate this version of the test app comparedto the original one? 1) Much worse 2) Somewhat worse3) About the same 4) Somewhat better 5) Much betterQ4. How often did you use this version of the test app sinceyou installed it on INSTALLATION DATE? a) Not at allb) Less than I normally use the test app c) About as muchas I normally use the test app d) More than I normally usethe test app

Crowdsourced Exploration of Security Conﬁgurations · We advocate an efﬁcient...

Documents

Transcript of Crowdsourced Exploration of Security Conﬁgurations · We advocate an efﬁcient...