Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

78
Hippocratic Hippocratic Data Management Data Management Rakesh Agrawal IBM Almaden Research Center

Transcript of Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Page 1: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

HippocraticHippocraticData ManagementData Management

Rakesh AgrawalIBM Almaden Research Center

Page 2: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ThesisThesis

We need information systems that– respect the privacy of data they manage

AND– do not impede the useful flow of information.

It is feasible to reconcile the apparent contradiction

Page 3: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

OutlineOutline

Why Privacy in Data SystemsSome Technology DirectionsSome Challenging Problems

Page 4: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Drivers for PrivacyDrivers for Privacy Privacy Surveys:

– 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99)

– 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001)

Govt. legislations & guidelines:– Fair Information Practices Act (US, 1974)– OECD Guidelines (Europe, 1980)– Canadian Standards Association’s Model Code (1995)– Australian Privacy Amendment (2000)– Japan: proposed legislation (2003) – HIPAA, GLB, Recent U.S. Federal & State Initiatives

Page 5: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.
Page 6: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy ViolationsPrivacy Violations

Accidents:– Kaiser, GlobalHealthrax

Lax security:– Massachusetts govt.

Ethically questionable behavior: – Lotus & Equifax, Lexis-Nexis, Medical Marketing

Service, Boston University, CVS & Giant Food Illegal:

– Toysmart

Page 7: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

AssertionAssertion

Enterprises lack tools and technologies for managing private data and enforcing privacy policies.

Page 8: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Founding Tenets of Current Founding Tenets of Current Database SystemsDatabase Systems

Ullman, “Principles of Database and Knowledgebase Systems”

Fundamental:– Manage persistent data.– Access a large amount of data efficiently.

Desirable:– Support for data model, high-level languages,

transaction management, access control, and resiliency.

Similar list in other database textbooks.

Page 9: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Statistical & Secure DatabasesStatistical & Secure Databases

Statistical Databases– Provide statistical information (sum, count, etc.) without

compromising sensitive information about individuals, [AW89]

Multilevel Secure Databases– Multilevel relations, e.g., records tagged “secret”,

“confidential”, or “unclassified”, e.g. [JS91] Need to protect privacy in transactional databases

that support daily operations.– Cannot restrict queries to statistical queries.– Cannot tag all the records “top secret”.

Page 10: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Our Research DirectionsOur Research Directions

Privacy Preserving Data MiningHippocratic Databases

Page 11: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Data Mining and PrivacyData Mining and Privacy

The primary task in data mining: development of models about aggregated data.

Can we develop accurate models without access to precise information in individual data records?

R. Agrawal, R. Srikant. Privacy Preserving Data Mining.ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.

Page 12: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy Preserving Data MiningPrivacy Preserving Data Mining

30 | 25K | … 50 | 40K | …

Randomizer

65 | 50K | …

Randomizer

35 | 60K | …

ReconstructAge Distribution

ReconstructSalary Distribution

Data MiningAlgorithm

Model

Page 13: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Reconstruction ProblemReconstruction Problem

Original values x1, x2, ..., xn – from probability distribution X

To hide these values, we use y1, y2, ..., yn

– from probability distribution YGiven

– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y

Estimate the probability distribution of X.

Page 14: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Intuition (Reconstruct single Intuition (Reconstruct single point) point)

Use Bayes' rule for density functions

10 90Age

V

Original distribution for Age

Probabilistic estimate of original value of V

Page 15: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Intuition (Reconstruct single Intuition (Reconstruct single point)point)

Original Distribution for Age

Probabilistic estimate of original value of V

10 90Age

V

Use Bayes' rule for density functions

Page 16: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Reconstruction: IntuitionReconstruction: Intuition

Combine estimates of where a point came from for all the points:– yields estimate of original distribution.

10 90Age

Page 17: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Reconstruction AlgorithmReconstruction Algorithm

fX0 := Uniform distribution

j := 0 repeat

fXj+1(a) := Bayes’ Rule

j := j+1 until (stopping criterion met)

Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

Page 18: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Works WellWorks Well

20

60

Age

0

200

400

600

800

1000

1200

Nu

mb

er

of

Peop

le

Original

Randomized

Reconstructed

Page 19: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ClassificationClassification

Naïve Bayes– Assumes independence between attributes.

Decision Tree– Correlations are weakened by randomization.

Page 20: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Experimental MethodologyExperimental Methodology

Compare accuracy against– Original: unperturbed data without randomization.– Randomized: perturbed data but without making any

corrections for randomization.

Test data not randomized. Synthetic data benchmark from [AGI+92]. Training set of 100,000 records, split equally

between the two classes.

Page 21: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Decision Tree ExperimentsDecision Tree Experiments

Fn 1 Fn 2 Fn 3 Fn 4 Fn 550

60

70

80

90

100

Acc

urac

y

Original

Randomized

Reconstructed

100% Randomization Level

Page 22: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Accuracy vs. RandomizationAccuracy vs. Randomization

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

ura

cy

Original

Randomized

Reconstructed

Fn 3

Page 23: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

So far…So far…

Question: Can we develop accurate models without access to precise information in individual data records?

Answer: yes, by randomization.– for numerical attributes, classification

How about Association Rules?

Page 24: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Associations RecapAssociations Recap

A transaction t is a set of items (e.g. books) All transactions form a set T of transactions Any itemset A has support s in T if

Itemset A is frequent if s smin

Task: Find all frequent itemsets

T

tATtAs

|#supp

Page 25: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

The ProblemThe Problem

How to randomize transactions so that– we can find frequent itemsets– while preserving privacy at transaction level?

Evfimievski, R. Srikant, R. Agrawal, J. Gehrke.Mining Association Rules Over Privacy Preserving Data.

8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.

Page 26: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Recommendation Service

Alice

Bob

J.S. Bach,painting,nasa.gov,…

J.S. Bach,painting,nasa.gov,…

B. Spears,baseball,cnn.com,…

B. Spears,baseball,cnn.com,…

B. Marley,camping,linux.org,…

B. Marley,camping,linux.org,…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Chris

B. Marley,camping,linux.org,…

Randomization Randomization OverviewOverview

Page 27: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Recommendation Service

Associations

Recommendations

Alice

Bob

J.S. Bach,painting,nasa.gov,…

J.S. Bach,painting,nasa.gov,…

B. Spears,baseball,cnn.com,…

B. Spears,baseball,cnn.com,…

B. Marley,camping,linux.org,…

B. Marley,camping,linux.org,…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Chris

B. Marley,camping,linux.org,…

Randomization Randomization OverviewOverview

Page 28: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Recommendation Service

Associations

Recommendations

Alice

Bob

Metallica,painting,nasa.gov,…

Metallica,painting,nasa.gov,…

B. Spears,soccer,bbc.co.uk,…

B. Spears,soccer,bbc.co.uk,…

B. Marley,camping,ibm.com…

B. Marley,camping,ibm.com…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Support Recovery

Chris

B. Marley,camping,linux.org,…

Randomization Randomization OverviewOverview

Page 29: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Uniform RandomizationUniform Randomization

Given a transaction,– keep item with, say 20% probability,– replace with a new random item with 80% probability.

Page 30: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Example: Example: {{xx,, y y,, z z}}

0.008%800 ts.97.8%

0.00016%16 trans.

1.9%

less than 0.00002%2 transactions

0.3%

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

• 0.22 • 8/10,000• 0.23at most

• 0.2 • (9/10,000)2

Privacy Breach: Given {x, y, z} in the randomized transaction,we have about 98% certainty of {x, y, z} in the original one

Page 31: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy BreachPrivacy BreachSuppose:

– t is an original transaction;– t’ is the corresponding randomized transaction;– A is a (frequent) itemset.

Definition: Itemset A causes a privacy breach of level if, for some item z A,

tAtz |Pr

Page 32: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Our SolutionOur Solution

Insert many false items into each transaction Hide true itemsets among false ones

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”

“He grows a forest to hide it in.”

G.K. Chesterton

Can we still find frequent itemsets while having sufficient privacy?

Page 33: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:

a, b, c, u, v, w, x, y, zt =

t’ =

Page 34: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

a, b, c, u, v, w, x, y, zt =

t’ =j = 4

Page 35: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

– Include j items of t into t’;

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ =j = 4

Page 36: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

– Include j items of t into t’;

– Each other item is included into t’ with probability pm .

The choice of Km and pm is based on the desired level of privacy.

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, …j = 4

Page 37: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Partial SupportsPartial SupportsTo recover original support of an itemset, we need randomized

supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is

– Here sk is the same as the support of A.

– Randomized partial supports are denoted by

lAtTtT

s

ssss

l

k

#|#1

,,...,, 10 where

.s

Page 38: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Transition MatrixTransition Matrix Let k = |A|, m = |t|. Transition matrix P = P (k, m) connects randomized

partial supports with original ones:

Randomized supports are distributed as a sum of multinomial distributions.

lAtlAtP

sPs

ll

#|#Pr

,E

,

where

Page 39: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

The Unbiased EstimatorsThe Unbiased Estimators

Given randomized partial supports, we can estimate original partial supports:

Covariance matrix for this estimator:

To estimate it, substitute sl with (sest)l .– Special case: estimators for support and its variance

1, PQsQs whereest

ljlijiliji

Tk

ll

PPPlD

QlDQsT

s

,,,,

0

][

][1

Cov

where

,est

Page 40: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy Breach AnalysisPrivacy Breach Analysis How many added items are enough to protect privacy?

– Have to satisfy Pr [z t | A t’] < ( no privacy breaches)– Select parameters so that it holds for all itemsets.– Use formula ( ):

Parameters are to be selected in advance!– Construct a privacy-challenging test: an itemset whose all subsets

have maximum possible support.– Enough to know maximal support of an itemset for each size.

k

llkl

k

llkl PsPstAtz

0,

0,|Pr

0,,#Pr 0 stzlAtsl

Page 41: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Lowest Discoverable SupportLowest Discoverable Support LDS is s.t., when predicted, is 4 away from zero. Roughly, LDS is proportional to

LDS vs. number of transactions

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100Number of transactions, millions

LD

S,

%

1-itemsets 2-itemsets 3-itemsets

|t| = 5, = 50%

T1

Page 42: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

LDS vs. Breach LevelLDS vs. Breach Level

0

0.5

1

1.5

2

2.5

30 40 50 60 70 80 90

Privacy Breach Level, %

LDS

, %

1-itemsets

2-itemsets

3-itemsets

|t| = 5, |T| = 5 M

Reminder: breach level is the limit on Pr [z t | A t’]

Page 43: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Real Datasets: soccer, mailorderReal Datasets: soccer, mailorder

Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests.– 11 K items (HTMLs), 6.5 M transactions

Mailorder is a purchase dataset from a certain on-line store– Products are replaced with their categories– 96 items (categories), 2.9 M transactions

Page 44: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ResultsResults

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 266 254 12 31

2 217 195 22 45

3 48 43 5 26

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 65 65 0 0

2 228 212 16 28

3 22 18 4 5

Soccer:

smin = 0.2%

0.07% for 3-itemsets

Mailorder:

smin = 0.2%

0.05% for 3-itemsets

Breach level = 50%.

Page 45: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

SummarySummary

Can have our cake and mine it too! Randomization is an interesting approach for building

data mining models while preserving user privacy!!!

Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.

S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in

Vertically Partitioned Data. KDD 2002.Vertically Partitioned Data. KDD 2002.

Page 46: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

The Hippocratic OathThe Hippocratic Oath

“What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.”

– Hippocratic Oath, 8 (circa 400 BC)

Page 47: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Hippocratic DatabasesHippocratic Databases

Founding tenet:Responsibility for the privacy of data they manage.

R. Agrawal, J. Kiernan, R. Srikant, Y. XuHippocratic Databases

28th Int'l Conf. on Very Large Databases (VLDB), August 2002..

Page 48: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ApproachApproach

Derive founding principles from current privacy legislation.

Strawman Design

Page 49: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Ten Principles of Hippocratic Ten Principles of Hippocratic DatabasesDatabases

Collection Group– Purpose Specification, Consent, Limited

CollectionUse Group

– Limited Use, Limited Disclosure, Limited Retention, Accuracy

Security & Openness Group– Safety, Openness, Compliance

Page 50: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Collection GroupCollection Group

1. Purpose Specification– For personal information stored in the database, the

purposes for which the information has been collected shall be associated with that information.

2. Consent– The purposes associated with personal information shall

have consent of the donor (person whose information is being stored).

3. Limited Collection– The information collected shall be limited to the minimum

necessary for accomplishing the specified purposes.

Page 51: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Use GroupUse Group

4. Limited Use– The database shall run only those queries that

are consistent with the purposes for which the information has been collected.

5. Limited Disclosure– Personal information shall not be

communicated outside the database for purposes other than those for which there is consent from the donor of the information.

Page 52: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Use Group (2)Use Group (2)

6. Limited Retention– Personal information shall be retained only as

long as necessary for the fulfillment of the purposes for which it has been collected.

7. Accuracy– Personal information stored in the database

shall be accurate and up-to-date.

Page 53: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Security & Openness GroupSecurity & Openness Group

8. Safety– Personal information shall be protected by security

safeguards against theft and other misappropriations.

9. Openness– A donor shall be able to access all information about

the donor stored in the database.

10. Compliance– A donor shall be able to verify compliance with the

above principles. Similarly, the database shall be able to address a challenge concerning compliance.

Page 54: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Strawman ArchitectureStrawman ArchitecturePrivacyPolicy

DataCollection

Queries Other

Store

Page 55: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: PolicyArchitecture: PolicyPrivacyPolicy

PrivacyMetadataCreator

StorePrivacyMetadata

For each purpose & piece of information (attribute):

• External recipients• Retention period• Authorized users

Different designs possible.

Converts privacy policy into privacy metadata tables.

LimitedDisclosure

LimitedRetention

Page 56: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy Policies TablePrivacy Policies Table

Purpose Table Attribute External-recipients

Authorized-users

Retention

purchase customer name {delivery, credit-card}

{shipping, charge}

1 month

purchase customer email empty {shipping} 1 month

register customer name empty {registration} 3 years

register customer email empty {registration} 3 years

recommendations

order book empty {mining} 10 years

Page 57: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: Data CollectionArchitecture: Data CollectionData

Collection

Store

PrivacyConstraintValidator

AuditInfo

AuditTrail

PrivacyMetadata

Privacy policy compatible with user’s privacy preference?

Audit trail for compliance. Compliance

Consent

Page 58: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: Data CollectionArchitecture: Data CollectionData

Collection

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditTrail

PrivacyMetadata

Data cleansing, e.g., errors in address.

RecordAccessControl

Associate set of purposes with each record.

Purpose Specification

Accuracy

Page 59: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: QueriesArchitecture: QueriesQueries

Store

AttributeAccessControl

PrivacyMetadata

RecordAccessControl

2. Query tagged “telemarketing” cannot see credit card info.

3. Telemarketing query only sees records that include “telemarketing” in set of purposes.

Safety

LimitedUse

1. Telemarketing cannot issue query tagged “charge”.

Safety

Page 60: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: QueriesArchitecture: QueriesQueries

Store

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

RecordAccessControl

Telemarketing query that asks for all phone numbers.

• Compliance• Training data for query intrusion detector

Safety

Compliance

Page 61: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Architecture: OtherArchitecture: Other

StorePrivacyMetadata

Other

DataRetentionManager

EncryptionSupport

Delete items in accordance with privacy policy.

Additional security for sensitive data.

DataCollectionAnalyzer

Analyze queries to identify unnecessary collection, retention & authorizations.

LimitedRetention

LimitedCollection

Safety

Page 62: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Strawman ArchitectureStrawman ArchitecturePrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

Page 63: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

StatusStatus

Prototyping core functionality of the designNibbling at some of the open problems (see

VLDB-2002 paper)

Page 64: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Privacy-Preserving Synthetic Privacy-Preserving Synthetic Datasets for Data Mining ResearchDatasets for Data Mining Research

How to randomize to be able to build multiple types of models

How to handle combination of data types

How to handle rare events

Credit AgenciesGovt

RecordsDemo-graphic

Birth Marriage

Comunications

Transactions

Synthetic Data

State

Local

o

Page 65: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Network is the DatabaseNetwork is the Database

What if private data never leaves a person’s data store?– Computations travel to

data

Jane’s DataCredit Application

Decision

Jane’s DataApproval Function

Result

Page 66: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Decision-Making Across Private Data RepositoriesDecision-Making Across Private Data Repositories

Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on

need-to-know basis. Example: Among those who took

a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn

anything beyond counts.

Minimal Necessary Sharing

R S R must not

know that S has b & y

S must not know that R has a & x

u

v

RSa

u

v

x

b

u

v

y

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Page 67: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Closing ThoughtsClosing Thoughts

The right to privacy: the most cherished of human freedoms

-- Warren & Brandeis, 1890 Code is law … it is all a matter of code: the

software and hardware that now rule -- L. Lessig

We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.

What do we want to do as computer scientists?

Page 68: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ReferencesReferences R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM

Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for

P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database

Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the

Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on

Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very

Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over

Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.

R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.

Page 69: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

New ChallengesNew Challenges

General– Language – Efficiency

Use– Limited Collection– Limited Disclosure– Limited Retention

Security and Openness– Safety– Openness– Compliance

Page 70: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

LanguageLanguage Need a language for privacy policies & user preferences. P3P can be used as starting point.

– Developed primarily for web shopping.– What about richer domains?

How do we balance expressibility and usability?

contact

email phone

home work

P3P recipients:

– Arrange concepts in hierarchy or subsumption relationship.

Purpose:

OursSame

DeliveryUnrelated

Public

Page 71: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Language (2)Language (2)

How do we accommodate user negotiation models? – User willing to disclose information only if

fairly compensated.– Value of privacy as coalitional game

[KPR2001]

Page 72: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

EfficiencyEfficiency

How do we minimize the cost of privacy checking? How do we incorporate purpose into database

design and query optimization? Tradeoffs between space & running time.

Only tag records in customer table with purpose, not all records. But now need to do a join when scanning records in order table.

How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?

Page 73: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Limited CollectionLimited Collection

How do we identify attributes that are collected but not used?– Assets are only needed for mortgage when salary is

below some threshold. What’s the needed granularity for numeric

attributes?– Queries only ask “Salary > threshold” for rent

application. How do we generate minimal queries?

– Redundancy may be hidden in application code.

Page 74: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Limited DisclosureLimited Disclosure

Can the user dynamically determine the set of recipients?

Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.

Digital signatures.

Page 75: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

Limited RetentionLimited Retention

Completely forgetting some information is non-trivial.

How do we delete a record from the logs and checkpoints, without affecting recovery?

How do we continue to support historical analysis and statistical queries without incurring privacy breaches?

Page 76: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

SafetySafety

Encryption provides additional layer of security.

How do we index encrypted data?How do we run queries against encrypted

data?[SWP00], [HILM02]

Page 77: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

OpennessOpenness

A donor shall be able to access all information about the donor stored in the database.

How does the database check Alice is really Alice and not somebody else?– Princeton admissions office broke into Yale’s

admissions using applicant’s social security number and birth date.

How does Alice find out what databases have information about her?– Symmetrically private information retrieval [GIKM98].

Page 78: Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center.

ComplianceCompliance Universal Logging

– Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?

– Use intermediaries who aggregate and analyze logs for many users.

Tracking Privacy Breaches– Insert “fingerprint” records with emails, telephone

numbers, and credit card numbers.– Some data may be more valuable for spammers or credit

card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?