Logic Programming for Data Tagging

20
Logic Programming for Data Tagging with Alexandra Wood Berkman Micah Altman MIT Kevin Wang Harvard Undergrad Aaron Bembenek Harvard Grad & Privacy Tools Summer Intern 2015 Stephen Chong Privacy Tools Project NSF Site Visit Monday October 19

Transcript of Logic Programming for Data Tagging

Logic Programming for Data Tagging

with Alexandra Wood Berkman

Micah Altman MIT

Kevin Wang Harvard Undergrad

Aaron BembenekHarvard Grad & Privacy Tools Summer Intern 2015

Stephen Chong Privacy Tools Project NSF Site Visit

Monday October 19

Stephen Chong, Harvard University.

Logic Programming for Data Tagging

2

Stephen Chong, Harvard University.

Logic Programming for Data Tagging

3

Datalog implementation

Formal legal model for tag recommendation

and license generation

Optimization of DataTags

Questionnaires

Stephen Chong, Harvard University.

This Work

Tag Recommendations and Licenses

4

Statutory text

Analysis in legal memos

Formal model of regulatory requirements

Data Tags

Tag characterization

Social Scientist

Privacy Legislation

Use and share data

Permitted/denied actions

Stephen Chong, Harvard University.

Tag Recommendations and LicensesMotivation and context

Formalize aspects of privacy legislation Using a logic programming language

Answer whether legislation/best practice permits or denies specific actions on data sets

Expert-system-like ability

Explore legislation e.g., find conditions where best practice contradictory

Combines computer science (formal modeling), law (legal research & analysis), social science (survey design), information science (taxonomies)

5

Stephen Chong, Harvard University.

This work

System design

6

Data OwnerData User

Transformations

Access Control

Data deposit

Deposit license

Recommended privacy attributes Recommended Tags Recommended license text

Local practices

Question textLicense textLogic rules

CMR formalization

FERPA formalization

Questions

Answers

Logic program

Stephen Chong, Harvard University.

Formal model: Actions

7

Deposit(dd, ds, r, cs)

Accept(r, ds, dd, cs)

Release(r, ds, du, dd, cs)...

dd : Data depositor

ds : Dataset du : Data user

r : Repository

cs : Condition set (provides further details about action)

Stephen Chong, Harvard University.

Permitted or DeniedActions can be permitted or denied

Or neither permitted or denied

E.g.,

8

Permitted(leg, a)

Denied(ferpa, Release(harvardDataverse, cs152grades-2015sp, [email protected], [email protected], [dataverseClickthrough]))

leg : Legislation

Denied(leg, a)

a : Action

Stephen Chong, Harvard University.

Example formalization

9

Let dd be the data depositor Let du be the data user Let ds be the data set Let r be the repository Let cs be a set of conditions IF CMR:depositorInScope(dd, ds) AND CMR:identifiable(ds) AND NOT (CMR:secure(r) AND CMR:isAcceptableConditionsForRelease(cs))THEN DENIED(Release(r, ds, du, dd, cs))

Let l be a license Let cs be a set of conditions IF License(l) ∈ cs AND licenseImplies(l, CMR:TransmissionEncrypted)THEN CMR:isAcceptableConditionsForRelease(cs)

Stephen Chong, Harvard University.

Formalization process

10

Define%Use%Cases%In%Scope%

Data%Deposit% Reten2on% Transforma2on% Dissemina2on%

Parse%Use%Case%

Components%

Actors:%data%controller,%

repository,%user%

Ac2ons:%accept%data,%store,%release,…%

En22es:%Data%set,%Record,…%

Legal%Review%of%Law%

Iden2fy%Restric2ons%on%Use%Cases%%

Iden2fy%Restric2ons%on%Actors,%Ac2ons,%

En22es%

Expert%Coding%of%Use%Case%Ac2ons%

Map%rules%permiGng%and%restric2ng%ac2on%

to%lawHspecific%characteris2cs%

Map%lawHspecific%characteris2cs%to%

license%text%affirma2ons%

Map%lawHspecific%characters2cs%to%general%proper2es%

Stephen Chong, Harvard University.

Demo

11

Stephen Chong, Harvard University.

DataTagsPermitted and denied actions are the interface between DataTags and legislation

May require more powerful language than Prolog...

12

Let dd be the data depositor Let du be the data user Let ds be the data set Let r be the repository Let t be the data tag IF isDataTag(t, r, ds) AND FOR ALL condition sets cs PERMITTED(Release(r, ds, du, dd, cs)) IMPLIES conditionsRequire(cs, ReidentificationProhibited) THEN atLeast(t, Yellow)

Stephen Chong, Harvard University.

Logic Programming for Data Tagging

13

Datalog implementation

Formal legal model for tag recommendation

and license generation

Optimization of DataTags

Questionnaires

Stephen Chong, Harvard University.

DataTags Questionnaire in DatalogWhen accepting dataset, ask depositor series of questions to determine DataTag Currently: nice domain specific language

But imperative with explicit control flow (i.e., gotos)

Goal: express more declaratively using Datalog Separate questions from control flow

Facilitate composition of questionnaires

Re-run old answers when questionnaire changes

...

14

Stephen Chong, Harvard University.

“Optimize” question orderGiven declarative questionnaire, what is "best" order to ask questions?

Fewest questions to reach decision? Ask questions from general to specific? Ask related questions at same time?

Assume some cost function for the question order Characterize as a game

Player asks question, opponent gives answer Player's goal: reach decision with lowest cost Determine strategy with lowest (expected) cost

Game tree too big to explore exhaustively E.g., with n questions, 3 answers per question, there are n! × 3n paths/final states But analysis of Datalog program can significantly reduce search

15

Stephen Chong, Harvard University.

Logic Programming for Data Tagging

16

Datalog implementation

Formal legal model for tag recommendation

and license generation

Optimization of DataTags

Questionnaires

Stephen Chong, Harvard University.

Our very own Datalog...Developed our own Datalog implementation

Can extend with language features

More flexible interface/efficient interaction

Make use of modern concurrent hardware

Will be used in Harvard undergrad PL course

...

Will be released open source

17

Stephen Chong, Harvard University.

Current stateCurrent state:

Six evaluation engines Top-down, bottom up, concurrent bottom up, ...

Exploring different concurrent techniques to improve scalability

Preliminary results: 1.2–5.5× speedup over XSB Prolog on OpenRuleBench transitive closure tests

Implemented hypotheticals

Graphical user interface Suitable for use by undergrad class!

18

Stephen Chong, Harvard University.

Moving forwardFormal legal model

License generation (from required conditions)

Review/independent validation of rules and license text

Independent validation of formalization process

Engagement with practitioners IRBs, state and local govt. agencies, educational data controllers, ...

Questionnaire representation and optimization Datalog

Release and use

Develop right logical extension for, e.g., connecting to DataTags

19

Stephen Chong, Harvard University.

Questions?

20

Datalog implementation

Formal legal model for tag recommendation

and license generation

Optimization of DataTags

Questionnaires