Logic Programming for Data Tagging
Transcript of Logic Programming for Data Tagging
Logic Programming for Data Tagging
with Alexandra Wood Berkman
Micah Altman MIT
Kevin Wang Harvard Undergrad
Aaron BembenekHarvard Grad & Privacy Tools Summer Intern 2015
Stephen Chong Privacy Tools Project NSF Site Visit
Monday October 19
Stephen Chong, Harvard University.
Logic Programming for Data Tagging
3
Datalog implementation
Formal legal model for tag recommendation
and license generation
Optimization of DataTags
Questionnaires
Stephen Chong, Harvard University.
This Work
Tag Recommendations and Licenses
4
Statutory text
Analysis in legal memos
Formal model of regulatory requirements
Data Tags
Tag characterization
Social Scientist
Privacy Legislation
Use and share data
Permitted/denied actions
Stephen Chong, Harvard University.
Tag Recommendations and LicensesMotivation and context
Formalize aspects of privacy legislation Using a logic programming language
Answer whether legislation/best practice permits or denies specific actions on data sets
Expert-system-like ability
Explore legislation e.g., find conditions where best practice contradictory
Combines computer science (formal modeling), law (legal research & analysis), social science (survey design), information science (taxonomies)
5
Stephen Chong, Harvard University.
This work
System design
6
Data OwnerData User
Transformations
Access Control
Data deposit
Deposit license
Recommended privacy attributes Recommended Tags Recommended license text
Local practices
Question textLicense textLogic rules
CMR formalization
FERPA formalization
Questions
Answers
Logic program
Stephen Chong, Harvard University.
Formal model: Actions
7
Deposit(dd, ds, r, cs)
Accept(r, ds, dd, cs)
Release(r, ds, du, dd, cs)...
dd : Data depositor
ds : Dataset du : Data user
r : Repository
cs : Condition set (provides further details about action)
Stephen Chong, Harvard University.
Permitted or DeniedActions can be permitted or denied
Or neither permitted or denied
E.g.,
8
Permitted(leg, a)
Denied(ferpa, Release(harvardDataverse, cs152grades-2015sp, [email protected], [email protected], [dataverseClickthrough]))
leg : Legislation
Denied(leg, a)
a : Action
Stephen Chong, Harvard University.
Example formalization
9
Let dd be the data depositor Let du be the data user Let ds be the data set Let r be the repository Let cs be a set of conditions IF CMR:depositorInScope(dd, ds) AND CMR:identifiable(ds) AND NOT (CMR:secure(r) AND CMR:isAcceptableConditionsForRelease(cs))THEN DENIED(Release(r, ds, du, dd, cs))
Let l be a license Let cs be a set of conditions IF License(l) ∈ cs AND licenseImplies(l, CMR:TransmissionEncrypted)THEN CMR:isAcceptableConditionsForRelease(cs)
Stephen Chong, Harvard University.
Formalization process
10
Define%Use%Cases%In%Scope%
Data%Deposit% Reten2on% Transforma2on% Dissemina2on%
Parse%Use%Case%
Components%
Actors:%data%controller,%
repository,%user%
Ac2ons:%accept%data,%store,%release,…%
En22es:%Data%set,%Record,…%
Legal%Review%of%Law%
Iden2fy%Restric2ons%on%Use%Cases%%
Iden2fy%Restric2ons%on%Actors,%Ac2ons,%
En22es%
Expert%Coding%of%Use%Case%Ac2ons%
Map%rules%permiGng%and%restric2ng%ac2on%
to%lawHspecific%characteris2cs%
Map%lawHspecific%characteris2cs%to%
license%text%affirma2ons%
Map%lawHspecific%characters2cs%to%general%proper2es%
Stephen Chong, Harvard University.
DataTagsPermitted and denied actions are the interface between DataTags and legislation
May require more powerful language than Prolog...
12
Let dd be the data depositor Let du be the data user Let ds be the data set Let r be the repository Let t be the data tag IF isDataTag(t, r, ds) AND FOR ALL condition sets cs PERMITTED(Release(r, ds, du, dd, cs)) IMPLIES conditionsRequire(cs, ReidentificationProhibited) THEN atLeast(t, Yellow)
Stephen Chong, Harvard University.
Logic Programming for Data Tagging
13
Datalog implementation
Formal legal model for tag recommendation
and license generation
Optimization of DataTags
Questionnaires
Stephen Chong, Harvard University.
DataTags Questionnaire in DatalogWhen accepting dataset, ask depositor series of questions to determine DataTag Currently: nice domain specific language
But imperative with explicit control flow (i.e., gotos)
Goal: express more declaratively using Datalog Separate questions from control flow
Facilitate composition of questionnaires
Re-run old answers when questionnaire changes
...
14
Stephen Chong, Harvard University.
“Optimize” question orderGiven declarative questionnaire, what is "best" order to ask questions?
Fewest questions to reach decision? Ask questions from general to specific? Ask related questions at same time?
Assume some cost function for the question order Characterize as a game
Player asks question, opponent gives answer Player's goal: reach decision with lowest cost Determine strategy with lowest (expected) cost
Game tree too big to explore exhaustively E.g., with n questions, 3 answers per question, there are n! × 3n paths/final states But analysis of Datalog program can significantly reduce search
15
Stephen Chong, Harvard University.
Logic Programming for Data Tagging
16
Datalog implementation
Formal legal model for tag recommendation
and license generation
Optimization of DataTags
Questionnaires
Stephen Chong, Harvard University.
Our very own Datalog...Developed our own Datalog implementation
Can extend with language features
More flexible interface/efficient interaction
Make use of modern concurrent hardware
Will be used in Harvard undergrad PL course
...
Will be released open source
17
Stephen Chong, Harvard University.
Current stateCurrent state:
Six evaluation engines Top-down, bottom up, concurrent bottom up, ...
Exploring different concurrent techniques to improve scalability
Preliminary results: 1.2–5.5× speedup over XSB Prolog on OpenRuleBench transitive closure tests
Implemented hypotheticals
Graphical user interface Suitable for use by undergrad class!
18
Stephen Chong, Harvard University.
Moving forwardFormal legal model
License generation (from required conditions)
Review/independent validation of rules and license text
Independent validation of formalization process
Engagement with practitioners IRBs, state and local govt. agencies, educational data controllers, ...
Questionnaire representation and optimization Datalog
Release and use
Develop right logical extension for, e.g., connecting to DataTags
19