Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.
-
Upload
dortha-norman -
Category
Documents
-
view
224 -
download
0
Transcript of Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.
HippocraticHippocraticData ManagementData Management
Rakesh AgrawalIBM Almaden Research Center
ThesisThesis
We need information systems that– respect the privacy of data they manage
AND– do not impede the useful flow of information.
It is feasible to reconcile the apparent contradiction
OutlineOutline
Why Privacy in Data SystemsSome Technology DirectionsSome Challenging Problems
Drivers for PrivacyDrivers for Privacy Privacy Surveys:
– 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99)
– 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001)
Govt. legislations & guidelines:– Fair Information Practices Act (US, 1974)– OECD Guidelines (Europe, 1980)– Canadian Standards Association’s Model Code (1995)– Australian Privacy Amendment (2000)– Japan: proposed legislation (2003) – HIPAA, GLB, Recent U.S. Federal & State Initiatives
Privacy ViolationsPrivacy Violations
Accidents:– Kaiser, GlobalHealthrax
Lax security:– Massachusetts govt.
Ethically questionable behavior: – Lotus & Equifax, Lexis-Nexis, Medical Marketing
Service, Boston University, CVS & Giant Food Illegal:
– Toysmart
AssertionAssertion
Enterprises lack tools and technologies for managing private data and enforcing privacy policies.
Founding Tenets of Current Founding Tenets of Current Database SystemsDatabase Systems
Ullman, “Principles of Database and Knowledgebase Systems”
Fundamental:– Manage persistent data.– Access a large amount of data efficiently.
Desirable:– Support for data model, high-level languages,
transaction management, access control, and resiliency.
Similar list in other database textbooks.
Statistical & Secure DatabasesStatistical & Secure Databases
Statistical Databases– Provide statistical information (sum, count, etc.) without
compromising sensitive information about individuals, [AW89]
Multilevel Secure Databases– Multilevel relations, e.g., records tagged “secret”,
“confidential”, or “unclassified”, e.g. [JS91] Need to protect privacy in transactional databases
that support daily operations.– Cannot restrict queries to statistical queries.– Cannot tag all the records “top secret”.
Our Research DirectionsOur Research Directions
Privacy Preserving Data MiningHippocratic Databases
Data Mining and PrivacyData Mining and Privacy
The primary task in data mining: development of models about aggregated data.
Can we develop accurate models without access to precise information in individual data records?
R. Agrawal, R. Srikant. Privacy Preserving Data Mining.ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.
Privacy Preserving Data MiningPrivacy Preserving Data Mining
30 | 25K | … 50 | 40K | …
Randomizer
65 | 50K | …
Randomizer
35 | 60K | …
ReconstructAge Distribution
ReconstructSalary Distribution
Data MiningAlgorithm
Model
Reconstruction ProblemReconstruction Problem
Original values x1, x2, ..., xn – from probability distribution X
To hide these values, we use y1, y2, ..., yn
– from probability distribution YGiven
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single Intuition (Reconstruct single point) point)
Use Bayes' rule for density functions
10 90Age
V
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single Intuition (Reconstruct single point)point)
Original Distribution for Age
Probabilistic estimate of original value of V
10 90Age
V
Use Bayes' rule for density functions
Reconstruction: IntuitionReconstruction: Intuition
Combine estimates of where a point came from for all the points:– yields estimate of original distribution.
10 90Age
Reconstruction AlgorithmReconstruction Algorithm
fX0 := Uniform distribution
j := 0 repeat
fXj+1(a) := Bayes’ Rule
j := j+1 until (stopping criterion met)
Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.
n
ij
XiiY
jXiiY
afayxf
afayxf
n 1 )())((
)())((1
Works WellWorks Well
20
60
Age
0
200
400
600
800
1000
1200
Nu
mb
er
of
Peop
le
Original
Randomized
Reconstructed
ClassificationClassification
Naïve Bayes– Assumes independence between attributes.
Decision Tree– Correlations are weakened by randomization.
Experimental MethodologyExperimental Methodology
Compare accuracy against– Original: unperturbed data without randomization.– Randomized: perturbed data but without making any
corrections for randomization.
Test data not randomized. Synthetic data benchmark from [AGI+92]. Training set of 100,000 records, split equally
between the two classes.
Decision Tree ExperimentsDecision Tree Experiments
Fn 1 Fn 2 Fn 3 Fn 4 Fn 550
60
70
80
90
100
Acc
urac
y
Original
Randomized
Reconstructed
100% Randomization Level
Accuracy vs. RandomizationAccuracy vs. Randomization
10 20 40 60 80 100 150 200
Randomization Level
40
50
60
70
80
90
100
Acc
ura
cy
Original
Randomized
Reconstructed
Fn 3
So far…So far…
Question: Can we develop accurate models without access to precise information in individual data records?
Answer: yes, by randomization.– for numerical attributes, classification
How about Association Rules?
Associations RecapAssociations Recap
A transaction t is a set of items (e.g. books) All transactions form a set T of transactions Any itemset A has support s in T if
Itemset A is frequent if s smin
Task: Find all frequent itemsets
T
tATtAs
|#supp
The ProblemThe Problem
How to randomize transactions so that– we can find frequent itemsets– while preserving privacy at transaction level?
Evfimievski, R. Srikant, R. Agrawal, J. Gehrke.Mining Association Rules Over Privacy Preserving Data.
8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.
Recommendation Service
Alice
Bob
J.S. Bach,painting,nasa.gov,…
J.S. Bach,painting,nasa.gov,…
B. Spears,baseball,cnn.com,…
B. Spears,baseball,cnn.com,…
B. Marley,camping,linux.org,…
B. Marley,camping,linux.org,…
B. Spears,baseball,cnn.com,…
J.S. Bach,painting,nasa.gov,…
Chris
B. Marley,camping,linux.org,…
Randomization Randomization OverviewOverview
Recommendation Service
Associations
Recommendations
Alice
Bob
J.S. Bach,painting,nasa.gov,…
J.S. Bach,painting,nasa.gov,…
B. Spears,baseball,cnn.com,…
B. Spears,baseball,cnn.com,…
B. Marley,camping,linux.org,…
B. Marley,camping,linux.org,…
B. Spears,baseball,cnn.com,…
J.S. Bach,painting,nasa.gov,…
Chris
B. Marley,camping,linux.org,…
Randomization Randomization OverviewOverview
Recommendation Service
Associations
Recommendations
Alice
Bob
Metallica,painting,nasa.gov,…
Metallica,painting,nasa.gov,…
B. Spears,soccer,bbc.co.uk,…
B. Spears,soccer,bbc.co.uk,…
B. Marley,camping,ibm.com…
B. Marley,camping,ibm.com…
B. Spears,baseball,cnn.com,…
J.S. Bach,painting,nasa.gov,…
Support Recovery
Chris
B. Marley,camping,linux.org,…
Randomization Randomization OverviewOverview
Uniform RandomizationUniform Randomization
Given a transaction,– keep item with, say 20% probability,– replace with a new random item with 80% probability.
Example: Example: {{xx,, y y,, z z}}
0.008%800 ts.97.8%
0.00016%16 trans.
1.9%
less than 0.00002%2 transactions
0.3%
1% have
{x, y, z}
5% have{x, y}, {x, z},or {y, z} only
10 M transactions of size 10 with 10 K items:
94%have one or zeroitems of {x, y, z}
• 0.22 • 8/10,000• 0.23at most
• 0.2 • (9/10,000)2
Privacy Breach: Given {x, y, z} in the randomized transaction,we have about 98% certainty of {x, y, z} in the original one
Privacy BreachPrivacy BreachSuppose:
– t is an original transaction;– t’ is the corresponding randomized transaction;– A is a (frequent) itemset.
Definition: Itemset A causes a privacy breach of level if, for some item z A,
tAtz |Pr
Our SolutionOur Solution
Insert many false items into each transaction Hide true itemsets among false ones
“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton
Can we still find frequent itemsets while having sufficient privacy?
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
a, b, c, u, v, w, x, y, zt =
t’ =
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
a, b, c, u, v, w, x, y, zt =
t’ =j = 4
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
a, b, c, u, v, w, x, y, zt =
b, v, x, zt’ =j = 4
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
– Each other item is included into t’ with probability pm .
The choice of Km and pm is based on the desired level of privacy.
a, b, c, u, v, w, x, y, zt =
b, v, x, zt’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, …j = 4
Partial SupportsPartial SupportsTo recover original support of an itemset, we need randomized
supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is
– Here sk is the same as the support of A.
– Randomized partial supports are denoted by
lAtTtT
s
ssss
l
k
#|#1
,,...,, 10 where
.s
Transition MatrixTransition Matrix Let k = |A|, m = |t|. Transition matrix P = P (k, m) connects randomized
partial supports with original ones:
Randomized supports are distributed as a sum of multinomial distributions.
lAtlAtP
sPs
ll
#|#Pr
,E
,
where
The Unbiased EstimatorsThe Unbiased Estimators
Given randomized partial supports, we can estimate original partial supports:
Covariance matrix for this estimator:
To estimate it, substitute sl with (sest)l .– Special case: estimators for support and its variance
1, PQsQs whereest
ljlijiliji
Tk
ll
PPPlD
QlDQsT
s
,,,,
0
][
][1
Cov
where
,est
Privacy Breach AnalysisPrivacy Breach Analysis How many added items are enough to protect privacy?
– Have to satisfy Pr [z t | A t’] < ( no privacy breaches)– Select parameters so that it holds for all itemsets.– Use formula ( ):
Parameters are to be selected in advance!– Construct a privacy-challenging test: an itemset whose all subsets
have maximum possible support.– Enough to know maximal support of an itemset for each size.
k
llkl
k
llkl PsPstAtz
0,
0,|Pr
0,,#Pr 0 stzlAtsl
Lowest Discoverable SupportLowest Discoverable Support LDS is s.t., when predicted, is 4 away from zero. Roughly, LDS is proportional to
LDS vs. number of transactions
0
0.2
0.4
0.6
0.8
1
1.2
1 10 100Number of transactions, millions
LD
S,
%
1-itemsets 2-itemsets 3-itemsets
|t| = 5, = 50%
T1
LDS vs. Breach LevelLDS vs. Breach Level
0
0.5
1
1.5
2
2.5
30 40 50 60 70 80 90
Privacy Breach Level, %
LDS
, %
1-itemsets
2-itemsets
3-itemsets
|t| = 5, |T| = 5 M
Reminder: breach level is the limit on Pr [z t | A t’]
Real Datasets: soccer, mailorderReal Datasets: soccer, mailorder
Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests.– 11 K items (HTMLs), 6.5 M transactions
Mailorder is a purchase dataset from a certain on-line store– Products are replaced with their categories– 96 items (categories), 2.9 M transactions
ResultsResults
Itemset Size
True Itemsets
True Positives
False Drops
False Positives
1 266 254 12 31
2 217 195 22 45
3 48 43 5 26
Itemset Size
True Itemsets
True Positives
False Drops
False Positives
1 65 65 0 0
2 228 212 16 28
3 22 18 4 5
Soccer:
smin = 0.2%
0.07% for 3-itemsets
Mailorder:
smin = 0.2%
0.05% for 3-itemsets
Breach level = 50%.
SummarySummary
Can have our cake and mine it too! Randomization is an interesting approach for building
data mining models while preserving user privacy!!!
Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.
S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in
Vertically Partitioned Data. KDD 2002.Vertically Partitioned Data. KDD 2002.
The Hippocratic OathThe Hippocratic Oath
“What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.”
– Hippocratic Oath, 8 (circa 400 BC)
Hippocratic DatabasesHippocratic Databases
Founding tenet:Responsibility for the privacy of data they manage.
R. Agrawal, J. Kiernan, R. Srikant, Y. XuHippocratic Databases
28th Int'l Conf. on Very Large Databases (VLDB), August 2002..
ApproachApproach
Derive founding principles from current privacy legislation.
Strawman Design
Ten Principles of Hippocratic Ten Principles of Hippocratic DatabasesDatabases
Collection Group– Purpose Specification, Consent, Limited
CollectionUse Group
– Limited Use, Limited Disclosure, Limited Retention, Accuracy
Security & Openness Group– Safety, Openness, Compliance
Collection GroupCollection Group
1. Purpose Specification– For personal information stored in the database, the
purposes for which the information has been collected shall be associated with that information.
2. Consent– The purposes associated with personal information shall
have consent of the donor (person whose information is being stored).
3. Limited Collection– The information collected shall be limited to the minimum
necessary for accomplishing the specified purposes.
Use GroupUse Group
4. Limited Use– The database shall run only those queries that
are consistent with the purposes for which the information has been collected.
5. Limited Disclosure– Personal information shall not be
communicated outside the database for purposes other than those for which there is consent from the donor of the information.
Use Group (2)Use Group (2)
6. Limited Retention– Personal information shall be retained only as
long as necessary for the fulfillment of the purposes for which it has been collected.
7. Accuracy– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness GroupSecurity & Openness Group
8. Safety– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9. Openness– A donor shall be able to access all information about
the donor stored in the database.
10. Compliance– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able to address a challenge concerning compliance.
Strawman ArchitectureStrawman ArchitecturePrivacyPolicy
DataCollection
Queries Other
Store
Architecture: PolicyArchitecture: PolicyPrivacyPolicy
PrivacyMetadataCreator
StorePrivacyMetadata
For each purpose & piece of information (attribute):
• External recipients• Retention period• Authorized users
Different designs possible.
Converts privacy policy into privacy metadata tables.
LimitedDisclosure
LimitedRetention
Privacy Policies TablePrivacy Policies Table
Purpose Table Attribute External-recipients
Authorized-users
Retention
purchase customer name {delivery, credit-card}
{shipping, charge}
1 month
purchase customer email empty {shipping} 1 month
register customer name empty {registration} 3 years
register customer email empty {registration} 3 years
recommendations
order book empty {mining} 10 years
Architecture: Data CollectionArchitecture: Data CollectionData
Collection
Store
PrivacyConstraintValidator
AuditInfo
AuditTrail
PrivacyMetadata
Privacy policy compatible with user’s privacy preference?
Audit trail for compliance. Compliance
Consent
Architecture: Data CollectionArchitecture: Data CollectionData
Collection
Store
PrivacyConstraintValidator
DataAccuracyAnalyzer
AuditInfo
AuditTrail
PrivacyMetadata
Data cleansing, e.g., errors in address.
RecordAccessControl
Associate set of purposes with each record.
Purpose Specification
Accuracy
Architecture: QueriesArchitecture: QueriesQueries
Store
AttributeAccessControl
PrivacyMetadata
RecordAccessControl
2. Query tagged “telemarketing” cannot see credit card info.
3. Telemarketing query only sees records that include “telemarketing” in set of purposes.
Safety
LimitedUse
1. Telemarketing cannot issue query tagged “charge”.
Safety
Architecture: QueriesArchitecture: QueriesQueries
Store
AuditInfo
AuditTrail
QueryIntrusionDetector
AttributeAccessControl
PrivacyMetadata
RecordAccessControl
Telemarketing query that asks for all phone numbers.
• Compliance• Training data for query intrusion detector
Safety
Compliance
Architecture: OtherArchitecture: Other
StorePrivacyMetadata
Other
DataRetentionManager
EncryptionSupport
Delete items in accordance with privacy policy.
Additional security for sensitive data.
DataCollectionAnalyzer
Analyze queries to identify unnecessary collection, retention & authorizations.
LimitedRetention
LimitedCollection
Safety
Strawman ArchitectureStrawman ArchitecturePrivacyPolicy
DataCollection
Queries
PrivacyMetadataCreator
Store
PrivacyConstraintValidator
DataAccuracyAnalyzer
AuditInfo
AuditInfo
AuditTrail
QueryIntrusionDetector
AttributeAccessControl
PrivacyMetadata
Other
DataRetentionManager
RecordAccessControl
EncryptionSupport
DataCollectionAnalyzer
StatusStatus
Prototyping core functionality of the designNibbling at some of the open problems (see
VLDB-2002 paper)
Privacy-Preserving Synthetic Privacy-Preserving Synthetic Datasets for Data Mining ResearchDatasets for Data Mining Research
How to randomize to be able to build multiple types of models
How to handle combination of data types
How to handle rare events
Credit AgenciesGovt
RecordsDemo-graphic
Birth Marriage
Comunications
Transactions
Synthetic Data
State
Local
o
Network is the DatabaseNetwork is the Database
What if private data never leaves a person’s data store?– Computations travel to
data
Jane’s DataCredit Application
Decision
Jane’s DataApproval Function
Result
Decision-Making Across Private Data RepositoriesDecision-Making Across Private Data Repositories
Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on
need-to-know basis. Example: Among those who took
a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn
anything beyond counts.
Minimal Necessary Sharing
R S R must not
know that S has b & y
S must not know that R has a & x
u
v
RSa
u
v
x
b
u
v
y
R
S
Count (R S) R & S do not learn
anything except that the result is 2.
Closing ThoughtsClosing Thoughts
The right to privacy: the most cherished of human freedoms
-- Warren & Brandeis, 1890 Code is law … it is all a matter of code: the
software and hardware that now rule -- L. Lessig
We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.
What do we want to do as computer scientists?
ReferencesReferences R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM
Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for
P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database
Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the
Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on
Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very
Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over
Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.
New ChallengesNew Challenges
General– Language – Efficiency
Use– Limited Collection– Limited Disclosure– Limited Retention
Security and Openness– Safety– Openness– Compliance
LanguageLanguage Need a language for privacy policies & user preferences. P3P can be used as starting point.
– Developed primarily for web shopping.– What about richer domains?
How do we balance expressibility and usability?
contact
email phone
home work
P3P recipients:
– Arrange concepts in hierarchy or subsumption relationship.
Purpose:
OursSame
DeliveryUnrelated
Public
Language (2)Language (2)
How do we accommodate user negotiation models? – User willing to disclose information only if
fairly compensated.– Value of privacy as coalitional game
[KPR2001]
EfficiencyEfficiency
How do we minimize the cost of privacy checking? How do we incorporate purpose into database
design and query optimization? Tradeoffs between space & running time.
Only tag records in customer table with purpose, not all records. But now need to do a join when scanning records in order table.
How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?
Limited CollectionLimited Collection
How do we identify attributes that are collected but not used?– Assets are only needed for mortgage when salary is
below some threshold. What’s the needed granularity for numeric
attributes?– Queries only ask “Salary > threshold” for rent
application. How do we generate minimal queries?
– Redundancy may be hidden in application code.
Limited DisclosureLimited Disclosure
Can the user dynamically determine the set of recipients?
Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.
Digital signatures.
Limited RetentionLimited Retention
Completely forgetting some information is non-trivial.
How do we delete a record from the logs and checkpoints, without affecting recovery?
How do we continue to support historical analysis and statistical queries without incurring privacy breaches?
SafetySafety
Encryption provides additional layer of security.
How do we index encrypted data?How do we run queries against encrypted
data?[SWP00], [HILM02]
OpennessOpenness
A donor shall be able to access all information about the donor stored in the database.
How does the database check Alice is really Alice and not somebody else?– Princeton admissions office broke into Yale’s
admissions using applicant’s social security number and birth date.
How does Alice find out what databases have information about her?– Symmetrically private information retrieval [GIKM98].
ComplianceCompliance Universal Logging
– Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?
– Use intermediaries who aggregate and analyze logs for many users.
Tracking Privacy Breaches– Insert “fingerprint” records with emails, telephone
numbers, and credit card numbers.– Some data may be more valuable for spammers or credit
card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?