Implementing Database Access Control Policy from ... Database Access Control Policy from...
Transcript of Implementing Database Access Control Policy from ... Database Access Control Policy from...
Implementing Database Access Control Policy from Unconstrained Natural
Language Text
LAS Research Presentation
John Slankas June 24th, 2015
1
Relation Extraction slides are from Dan Jurafsky’s NLP Course on Coursera
Research Path & Publications
2
Policy
2012
NaturiliSE
2013
ICSE Doctoral
Symposium 2013
PASSAT
2013ACSAC
2014
ESEM
20151
RE 20143
ESEM 20142
ASE Science
Journal 2013
Feasibility
Classification
Access Control
Extraction
Database Model
Extraction
1 to be submitted
2 2
nd Author
3 3
rd Author
Agenda
• Motivation
• Research Goal
• Background and Related Work – focus on Relation Extraction
• Solution - Role Extraction and Database Enforcement
• Studies
• Classification
• Access Control Extraction
• Database Model Extraction & End to End Implementation
• Limitations
• Future Work
• Research Goal Evaluation & Contributions
3
Motivation Goal Related Work Solution Studies Limitations Future Work
2015 – The Year of Healthcare Hack [Peterson 2015]
Two major breaches
Anthem – 80 million records
Premera – 11 million records
Experts fault Anthem for lack of robust access control [Bennett 2015] [Husain 2015] [Redhead 2015] [Westin 2015]
4
Motivation Goal Related Work Solution Studies Limitations Future Work
Research Goal
Improve security and compliance by ensuring access
control rules (ACRs) explicitly and implicitly defined
within unconstrained natural language product
artifacts are appropriately enforced within a system’s
relational database.
6
Motivation Goal Related Work Solution Studies Limitations Future Work
Background
Access Control Rules (ACRs)
Regulate who can perform actions on resources
(subject, action, object)
Database Model Elements (DMEs)
Organization of stored data
Entities: “thing” in the real world
Attributes: property the describes an entity
Relationships: association between two entities
7
Extracting relations from text
Company report: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”
Extracted Complex Relation: Company-Founding
Company IBM Location New York Date June 16, 1911 Original-Name Computing-Tabulating-Recording Co.
But we will focus on the simpler task of extracting relation triples
Founding-year(IBM,1911)
Founding-location(IBM,New York)
Extracting Relation Triples from
Text
The Leland Stanford Junior
University, commonly referred to as
Stanford University or Stanford, is
an American private research
university located in Stanford,
California … near Palo Alto,
California… Leland
Stanford…founded the university in
1891 Stanford EQ Leland Stanford Junior University
Stanford LOC-IN California
Stanford IS-A research university
Stanford LOC-NEAR Palo Alto
Stanford FOUNDED-IN 1891
Stanford FOUNDER Leland Stanford
Why Relation Extraction?
Create new structured knowledge bases, useful for
any app
Augment current knowledge bases
Adding words to WordNet thesaurus, facts to FreeBase
or DBPedia
Support question answering
The granddaughter of which actor starred in the movie
“E.T.”? (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)
But which relations should we extract?
10
Automated Content Extraction (ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL
PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
Geographical
Subsidiary
Sports-Affiliation
17 relations from 2008 “Relation Extraction Task”
Automated Content Extraction (ACE)
Physical-Located PER-GPE
He was in Tennessee
Part-Whole-Subsidiary ORG-ORG
XYZ, the parent company of ABC
Person-Social-Family PER-PER
John’s wife Yoko
Org-AFF-Founder PER-ORG
Steve Jobs, co-founder of Apple…
12
Databases of Wikipedia Relations
13
Relations extracted from Infobox
Stanford state California
Stanford motto “Die Luft der Freiheit weht”
…
Wikipedia Infobox
Relation databases
that draw from Wikipedia Resource Description Framework (RDF) triples
subject predicate object
Golden Gate Park location San Francisco
dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco
DBPedia: 1 billion RDF triples, 385 from English
Wikipedia
Frequent Freebase relations: people/person/nationality, location/location/contains
people/person/profession, people/person/place-of-birth
biology/organism_higher_classification film/film/genre 14
Ontological relations
IS-A (hypernym): subsumption between classes
Giraffe IS-A ruminant IS-A ungulate IS-A mammal
IS-A vertebrate IS-A animal…
Instance-of: relation between individual and class
San Francisco instance-of city
Examples from the WordNet Thesaurus
How to build relation extractors
1. Hand-written patterns
2. Supervised machine learning
3. Semi-supervised and unsupervised
Bootstrapping (using seeds)
Distant supervision
Unsupervised learning from the web
Rules for extracting IS-A relation
Early intuition from Hearst (1992)
“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
What does Gelidium mean?
How do you know?`
Rules for extracting IS-A relation
Early intuition from Hearst (1992)
“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
What does Gelidium mean?
How do you know?`
Hearst’s Patterns for extracting IS-A
relations
(Hearst, 1992): Automatic Acquisition of Hyponyms
“Y such as X ((, X)* (, and|or) X)”
“such Y as X”
“X or other Y”
“X and other Y”
“Y including X”
“Y, especially X”
Hearst’s Patterns for extracting IS-A relations
Hearst pattern Example occurrences
X and other Y ...temples, treasuries, and other important civic buildings.
X or other Y Bruises, wounds, broken bones or other injuries...
Y such as X The bow lute, such as the Bambara ndang...
Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.
Y including X ...common-law countries, including Canada and England...
Y , especially X European countries, especially France, England, and Spain...
Hand-built patterns for relations
Plus:
Human patterns tend to be high-precision
Can be tailored to specific domains
Minus
Human patterns are often low-recall
A lot of work to think of all possible patterns!
Don’t want to have to do this for every relation!
We’d like better accuracy
Supervised machine learning for
relations
Choose a set of relations we’d like to extract
Choose a set of relevant named entities
Find and label data
Choose a representative corpus
Label the named entities in the corpus
Hand-label the relations between these entities
Break into training, development, and test
Train a classifier on the training set 22
How to do classification in
supervised relation extraction
1. Find all pairs of named entities (usually in same sentence)
2. Decide if 2 entities are related
3. If yes, classify the relation
Why the extra step?
Faster classification training by eliminating most pairs
Can use distinct feature-sets appropriate for each task.
23
Relation Extraction
Classify the relation between two entities in a sentence
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
SUBSIDIARY
FAMILY EMPLOYMENT
NIL
FOUNDER
CITIZEN
INVENTOR …
Word Features for Relation Extraction
Headwords of M1 and M2, and combination Airlines Wagner Airlines-Wagner
Bag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}
Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman
M2: +1 said
Bag of words or bigrams between the two entities
{a, AMR, of, immediately, matched, move, spokesman, the, unit}
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said
Mention 1 Mention 2
Named Entity Type and Mention Level
Features for Relation Extraction
Named-entity types
M1: ORG
M2: PERSON
Concatenation of the two named-entity types
ORG-PERSON
Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)
M1: NAME [it or he would be PRONOUN]
M2: NAME [the company would be NOMINAL]
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said
Mention 1 Mention 2
Parse Features for Relation Extraction
Base syntactic chunk sequence from one to
the other
NP NP PP VP NP NP
Constituent path through the tree from one to
the other
NP NP S S NP
Dependency path
Airlines matched Wagner said
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said
Mention 1 Mention 2
Gazeteer and trigger word features
for relation extraction Trigger list for family: kinship terms
parent, wife, husband, grandparent, etc. [from WordNet]
Gazeteer:
Lists of useful geo or geopolitical words
Country name list
Other sub-entities
Classifiers for supervised methods
Now you can use any classifier you like MaxEnt
Naïve Bayes
SVM
...
Train it on the training set, tune on the dev
set, test on the test set
Evaluation of Supervised Relation
Extraction
Compute P/R/F1 for each relation
31
P =# of correctly extracted relations
Total # of extracted relations
R =# of correctly extracted relations
Total # of gold relations
F1 =2PR
P+R
Summary: Supervised Relation
Extraction
+ Can get high accuracies with enough hand-
labeled training data, if test similar enough to
training
- Labeling a large training set is expensive
- Supervised models are brittle, don’t generalize
well to different genres
Motivation Goal Related Work Solution Studies Limitations Future Work
Selected Related Work Access Control Extraction
• Requirements-based Access Control Analysis and Policy
Specification [He 2009]
• Automated Extraction and Validation of Security Policies from
Natural Language Documents [Xiao 2009]
Database Model Extraction
• English Sentence Structures and Entity-Relationship Diagrams
[Chen 1983]
• Heuristics-based Entity-Relationship Modeling through NLP
[Omar 2004]
• Conceptual Modeling of Natural Language Functional
Requirements [Sagar 2014]
33
Motivation Goal Related Work Solution Studies Limitations Future Work
Role Extraction and Database Enforcement
34
Role Extraction andDatabase Enforcement
Text DocumentsDatabase DesignDomain Knowledge
Generated SQL Commands for access controlCompleteness and Conflict ReportTraceability Report
1) Parse natural language product artifacts2) Classify sentence3) Extract access control elements4) Extract database model elements5) Map data model to physical database schema6) Implement access control
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 1: Parse Natural Language Product Artifacts
Generate intermediate representation from text
35
“A nurse can order a lab procedure for a patient.”
Named Entities:
A action
R resource
S subject
Parts of Speech:
NN noun
VB verb
Relationships:
dobj direct object
nn noun compound modifier
nsubj nominative subject
prep_for prepositional modifier – for
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 2: Classify Sentence
Performs two classifications on each sentence:
1. Does the sentence contain ACRs?
2. Does the sentence contain DMEs?
Example 1:
36
ACRs – Yes
DMEs – Yes
Example 2:
“Lab procedures have a date-ordered, lab-type,
and current status.”
ACRs – No
DMEs – Yes
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 3: Semantic Relation Extraction
37
Specific Actionnsubj dobjVBA
NNS*
NNR*
Generate Seed
Patterns
Match Subject
and Resources
Apply
Patterns
Known Subjects
& Resources
Access Control
Rules
Subject &
Resource Search
Pattern Extraction
and Transformation
Classify
Patterns
Pattern
Set
Inject Patterns
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 3: Extract Access Control Elements
38
Semantic Relations:
(order, nurse, lab procedure)
(order_for, nurse, patient)
Relational Pattern:
order – nsubj – nurse, – dobj – lab procedure
order – nsubj – nurse, - prep_for - patient
order
lab procedure
nsubj
prep_for
dobj
NN
VBA
Rnurse
NNSpatient
NNRcan
aux
MD
Use semantic relations to extract information
Access Control Rules
(nurse, order, lab procedure, create)
(nurse, order_for, patient, read)
39
Semantic Relations:
(order, nurse, lab procedure)
(order_object_for, lab procedure, patient)
Semantic Relational Pattern:
order – nsubj – nurse, – dobj – lab procedure
order – dobj – lab procedure, - prep_for - patient
order
lab procedure
nsubj
prep_for
dobj
NN
VBA
Rnurse
NNSpatient
NNRcan
aux
MD
Use semantic relations to extract information
Database Elements:
Entities: lab procedure, patient
Relationship: nurse orders lab procedure
lab procedure for patient
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 4: Extract Database Model Elements
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 5: Map Data Model to Physical Database Schema
• Merge ACRs and database model elements
• Map subjects to roles
• Map objects to tables
40
Access Control Rules
(nurse, order, lab procedure, create)
(nurse, order_for, patient, read)
Database Elements:
Entities: lab procedure, patient
Relationship: nurse orders lab procedure
lab procedure for patient
Physical Database Schema:
lab_procedure_tbl
patient_tbl
lab_procedure_patient_tbl
role: nurse_rl
Merged ACRs
(nurse, order, lab procedure, create)
(nurse, order_for, patient, read)
(nurse, order, lab procedure_patient, create)
(nurse, order_for, lab procedure_patient, read)
Database Access Rules
(nurse, lab procedure, create)
(nurse, patient, read)
(nurse, lab procedure_patient, create/read)
Motivation Goal Related Work Solution Studies Limitations Future Work
Step 6: Implement Access Control
• Perform Sanity Checks
• Conflict detection
• Unmapped subjects and resources
• Generate SQL Commands
create role nurse_rl;
grant insert on lab_procedure_tbl to nurse_rl;
grant select on patient_tbl to nurse_rl;
grant insert, select on lab_procedure_patient_tbl to nurse_rl;
41
Motivation Goal Related Work Solution Studies Limitations Future Work
Process Challenges
• Ambiguity
• Pronouns
• Missing elements
• “Generic” words – (e.g., list, item, data)
• Synonyms
• Negativity
• Schema mismatches
• Names
• Cardinality
42
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 1: Classification Study [NaturaliSE 2013]
43
Research ability to classify sentences
Why?
• What needs to be processed further
• Prevent false positives
Focus
• Processing activities needs
• Determine appropriate sentence representation(s)
• Classifier and feature performance
Study Documents: 11 Healthcare related documents
44
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 1: What Classification Algorithm to Use?
Classifier P R 𝑭𝟏
Weighted Random .047 .060 .053
50% Random .044 .502 .081
Naïve Bayes .227 .347 .274
SVM .728 .544 .623
NFR Locator k-NN .691 .456 .549
Classifying Non Functional Requirements
45
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 1: What Classification Algorithm to Use?
Similarity Ratio 𝑭𝟏 % Classified
0.5 .85 56%
0.6 .82 63%
0.7 .78 74%
0.8 .75 86%
0.9 .71 96%
1.0 .70 96%
∞ .63 100%
Security Requirements: k-NN with Similarity Check
Conclusion: Use ensemble-based classifier
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 2: Access Control Extraction [PASSAT 2013] [ASE Science Journal 2013] [ACSAC 2014]
46
Research ability to identify and extract access control rules
Why?
• Determine access control to implement
Focus
• Bootstrap knowledge to find ACRs
• Extend pattern set while preventing false positives
Study Documents
Document Domain Document Type
# of Sentences
# of ACR Sentences
# of ACRs
Fleiss’ Kappa
iTrust Healthcare Use Case 1160 550 2274 0.58 iTrust for Text2Policy Healthcare Use Case 471 418 1070 0.73 IBM Course Mgmt Education Use Case 401 169 375 0.82 CyberChair Conf. Mgmt Seminar Paper 303 139 386 0.71 Collected ACP Docs Multiple Sentences 142 114 258 n/a
47
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 2: What Properties do ACR Sentences Have?
iTrust iTrust_t2p IBM CM CyberChair Collected Text2Policy Pattern – Modal Verb 210 130 46 71 93 Text2Policy Pattern – Passive voice w/ to Infinitive
66 21 10 39 9
Text2Policy Pattern – Access Expression 32 7 5 1 18 Text2Policy Pattern – Ability Expression 45 21 14 11 3
Number of sentences with multiple types of ACRs
383 146 77 105 36
Number of patterns appearing once or twice
680 173 162 184 97
ACRs with ambiguous subjects (e.g. “system”, “user”, etc.)
193 119 139 1 13
ACRs with blank subjects 557 206 29 187 5 ACRs with pronouns as subjects 109 28 5 11 11 ACRs with ambiguous objects (e.g., entry, list, name,etc.)
422 228 45 47 34
Total Number of ACR Sentences 550 418 169 139 114 Total Number of ACR Rules 2274 1070 375 386 258
48
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 2: Identifying ACR Sentences
Document Precision Recall F1
iTrust for Text2Policy .96 .99 .98
iTrust .90 .86 .88
IBM Course Management .83 .92 .87
CyberChair .63 .64 .64
Collected ACP .83 .96 .89
All documents, 10-fold .81 .84 .83
49
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 2: Identifying ACR Sentences without Training Sets
Classification Performance (F1) by Completion %
50
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 2: Extracting ACRs
Precision Recall F1
iTrust for Text2Policy .80 .75 .77
iTrust for ACRE .75 .60 .67
IBM Course Management .81 .62 .70
CyberChair .75 .30 .43
Collected ACP .68 .18 .29
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 3: Database Model Extraction [ESEM 2015 (to submit)]
51
Research ability to extract database model and implement
process from start to finish
Why?
• Need to map ACRs to environment
Challenges
• Patterns
• Completeness
Case Study: Open Conference System
52
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 3: Classification Results
Precision Recall F1 OCS (train using CyberChair) ACR .82 .29 .42 CyberChair (train using OCS) ACR .75 .61 .67 OCS, 10-fold self-validation ACR .81 .78 .79 OCS (train using CyberChair) DME .82 .29 .42 CyberChair (train using OCS) DME .75 .61 .67 OCS, 10-fold self-validation DME .83 .78 .79
Does a sentence have ACRs and/or DMEs?
53
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 3: Extracting DMEs
Precision Recall F1
Perfect Knowledge from ACRs 1.00 .89 .94 Results from ACR Process 1.00 .81 .90
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 3: Database Design Extraction
54
Number of System Oracle Process ACRs 52 730 686 resolved subjects 524 686 resolved objects 39 35 merged rules 272 481 discovered roles 7 21 1 discovered entities 52 223 213
55
Motivation Goal Related Work Solution Studies Limitations Future Work
Limitations
Limitations
• Text-based process
• Conditional access
• Rule-based resolution
• Only considered access at a table level
• Mapping discovered roles and entities to the actual database
manually performed
• Only examined one system within a given problem domain for
end-to-end validation
• System implementation may not match documentation • Different functionality
• Effective dating/status in place of deletes/updates
Motivation Goal Related Work Solution Studies Limitations Future Work
Future Work
• Access control rules
• Temporal orderings
• Conditions / constraints
• Database model elements
• Field types
• Values / ranges
• Human computer interaction
56
Research Goal Evaluation
Improve security and compliance by
• identify and extract access control rules (ACRs)
• identify and extract database model elements (DMEs)
• implement defined access control rules in a system’s
database
Confirmation
• Identify ACR Sentences: .83 𝐹1
• Extract ACRs: .29 to .77 𝐹1
• Identify DME Sentences: .79 𝐹1
• Extract DMEs: .90 𝐹1
• Generated # of ACRs: 272
57
Motivation Goal Related Work Solution Studies Limitations Future Work
Contributions
• Approach and supporting tool *
• Sentence similarity algorithm
• Bootstrapping algorithms
• Labeled corpora*
• Pattern distributions
* https://github.com/RealsearchGroup/REDE
58
References [Bennett 2015] Bennett, Cory. Weak Login Security at Heart of Anthem Breach.
http://thehill.com/policy/cybersecurity/232158-weak-login-security-at-heart-of-anthem-breach
Accessed: 3/15/2015
[Chen 1983] Chen, Peter. English Sentence Structure and Entity-Relationship Diagrams. Information Series 29:
127-149. 1983.
[He 2009] He, Q. and Antón, A.I., Requirements-based Access Control Analysis and Policy Specification
(ReCAPS). Information and Software Technology, vol. 51, no. 6, pp 993-1009, 2009.
[Husain 2015] Husain, Azam. What the Anthem Breach Teaches US About Access Control.
http://www.healthitoutcomes.com/doc/what-the-anthem-breach-teaches-us-about-access-control-
0001. Accessed 3/15/2015
[Omar 2004] Omar, Nazlia. Heuristics-Based Entity-Relationship Modelling through Natural Language
Processing, PhD Dissertation, University of Ulster, 2004.
[Peterson 2015] Peterson, Andrea. 2015 is already the year of the health-care hack – and it’s only going to get
worse. Washington Post. Washington D.C., 3/20/2015.
[Redhead 2015] Redhead, C. Stephen. Anthem Data Breach: How Safe Is Health Information Under HIPAA,
http://fas.org/sgp/crs/misc/IN10235.pdf. Congressional Research Service Report. Accessed
3/16/2015
[Sagar 2014] Sagar, Vidhu Bhala R. Vidya and Abrirami, S. Conceptual Modeling of Natural Language Functional
Requirements, Journal of Systems and Software, v 88, 25-41, 2014
[Westin 2015] Westin, Ken. How Anthem Could be Breached. http://www.tripwire.com/state-of-security/incident-
detection/how-the-anthem-breach-could-have-happened/. Accessed: 3/15/2015
[Xiao 2009] Xiao, X., Paradkar, A., Thummalapenta, S. and Xie, T. Automated Extraction of Security Policies
from Natural-Language Software Documents. International Symposium on the Foundations of
Software Engineering (FSE), Raleigh, North Carolina, USA, 2012.
59
References
[Slankas 2015] Slankas, John, and Williams, Laurie, "Relation Extraction for Inferring Database Models from Natural Language
Artifacts" , 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
(ESEM 2015) to be submitted.
[Slankas 2014] Slankas, John, Xiao, Xushang, Williams, Laurie, and Xie, Tao, "Relation Extraction for Inferring Access Control
Rules from Natural Language Artifacts" , 2014 Annual Computer Security Applications Conference (ACSAC
2014), New Orleans, LA.
[Riaz 2014b] Riaz, Maria, Slankas, John, King, Jason, and Williams, Laurie, "Using Templates to Elicit Implied Security
Requirements from Functional Requirements − A Controlled Experiment", ACM / IEEE 8th International
Symposium on Empirical Software Engineering and Measurement (ESEM 2014), Torino, Italy, September 18-19,
2014
[Riaz 2014a] Riaz, Maria, King, Jason, Slankas, John, and Williams, Laurie, "Hidden in Plain Sight: Automatically Identifying
Security Requirements from Natural Language Artifacts", 2014 Requirements Engineering (RE 2014), Karlskrona,
Sweeden, August 25-29, 2014
[Slankas 2013d] Slankas, John and Williams, Laurie, 2013. Access Control Policy Identification and Extraction from Project
Documentation, Academy of Science and Engineering Science Journal Volume 2, Issue 3. p145-159.
[Slankas 2013c] Slankas, John and Williams, Laurie, "Access Control Policy Extraction from Unconstrained Natural Language
Text", 2013 ASE/IEEE International Conference on Privacy, Security, Risk, and Trust (PASSAT 2013),
Washington D.C., USA, September 8-14, 2013.
[Slankas 2013b] Slankas, John and Williams, Laurie, "Automated Extraction of Non-functional Requirements in Available
Documentation", 1st International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE
2013), San Francisco, CA.
[Slankas 2013a] Slankas, John, "Implementing Database Access Control Policy from Unconstrained Natural Language Text", 35th
International Conference on Software Engineering - Doctoral Symposium (ICSE DS 2013), San Francisco, CA.
[Slankas 2012] Slankas, John and Williams, Laurie, "Classifying Natural Language Sentences for Policy", IEEE International
Symposium on Policies for Distributed Systems and Networks (POLICY 2012)
60
Additional Information
Other Solutions to Inappropriate Data
Access
62
• Auditing
• Intrusion detection
• Manually establish access control
• Completeness
• Correctness
• Effort
Additional Information
Machine Learning Background
63
• Combines computer science and statistics
• Supervised vs. Unsupervised
• Sample algorithms
• k-nearest neighbor (k-NN)
• Naïve bayes
• Decision trees
• Regression
• k-means clustering
Additional Information
Semantic Relation Related Work
64
1992 Hearst – Automatic Acquisition of Hyponyms from Large Text
Corpora
2004 Snow et al., Learning Syntactic Patterns for Automatic
Hypernym Discovery
2005 Zhou et al., Exploring Various Knowledge in Relation
Extraction
Additional Information
Natural Language Parsers
65
Apache OpenNLP: http://opennlp.apache.org/
Berkeley Parser: http://nlp.cs.berkeley.edu/
BLLIP (Charniak-Johnson): http://bllip.cs.brown.edu/
GATE: https://gate.ac.uk
MALLET: http://mallet.cs.umass.edu/
Python Natural Language Toolkit: http://www.nltk.org/
Stanford Natural Language Parser: http://nlp.stanford.edu/
Criteria:
• Performs well
• Open-source, maintained, well-documented
• Java
Additional Information
NLP Outputs
66
POS Tagging:
The/DT nurse/NN can/MD order/VB a/DT lab/NN procedure/NN
for/IN a/DT patient/NN ./.
Parse: (ROOT
(S
(NP (DT The) (NN nurse))
(VP (MD can)
(VP (VB order)
(NP (DT a) (NN lab) (NN procedure))
(PP (IN for)
(NP (DT a) (NN patient)))))
(. .)))
Typed Dependency: det(nurse-2, The-1)
nsubj(order-4, nurse-2)
aux(order-4, can-3)
root(ROOT-0, order-4)
det(procedure-7, a-5)
nn(procedure-7, lab-6)
dobj(order-4, procedure-7)
prep(order-4, for-8)
det(patient-10, a-9)
pobj(for-8, patient-10)
Additional Information
Precision, Recall, F1 Measure
Precision (P) is the proportion of correctly predicted access control
statements: 𝑃 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃).
Recall (R) is the proportion of access control statements found:
𝑅 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
F1 Measure is the harmonic mean between P and R:
F1 = 2 × 𝑃 × 𝑅 /(𝑃 + 𝑅)
67
Expected
Classification
Yes No
Predicted
Classification
Yes True
Positive
False
Negative
No False
Negative
True
Negative
Additional Information
Inter-rater Agreement (Fleiss’ Kappa)
How well do multiple raters agree beyond what’s possible by
chance?
𝜅 = 𝑃 − 𝑃𝑒
1 − 𝑃𝑒
Degree of agreement attained above chance divided by the degree
of agreement possible above chance
68
Fleiss’ Kappa Agreement Interpretation
<= 0 Less than chance
0.01 – 0.20 Slight
0.21 – 0.40 Fair
0.41 – 0.60 Moderate
0.61 – 0.80 Substantial
0.81 – 0.99 Almost perfect
Step 3: Semantic Relations
Use semantic relation extraction to extract access control
elements from natural language text.
Semantic relation: underlying meaning between two concepts
Examples:
69
Hypernymy (is-a) users, such as nurses authenticate …
Meronymy (whole-part) a patient’s vital signs
Verb Phrases
customers rent cars
𝑨( 𝒔 , 𝒂 , 𝒓 , 𝒏 , 𝒍 , 𝒄 , 𝑯, 𝒑)
𝑠 vertices composing the subject
𝑎 vertices composing the action
𝑟 vertices composing the resource
𝑛 vertex representing negativity
𝑙 vertex representing limitation to a specific role
𝑐 vertices providing context to the access control policy
𝐻 subgraph required to connect all previous vertices
𝑝 set of permission associated with the current policy
𝐴( 𝑛𝑢𝑟𝑠𝑒 , 𝑜𝑟𝑑𝑒𝑟 , 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒 , , , 𝑉: 𝑛𝑢𝑟𝑠𝑒, 𝑜𝑟𝑑𝑒𝑟, 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒;
𝐸: (𝑜𝑟𝑑𝑒𝑟, 𝑛𝑢𝑟𝑠𝑒, 𝑛𝑠𝑢𝑏𝑗); (𝑜𝑟𝑑𝑒𝑟, 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒, 𝑑𝑜𝑏𝑗) ), 𝑐𝑟𝑒𝑎𝑡𝑒)
𝐴( 𝑛𝑢𝑟𝑠𝑒 , 𝑜𝑟𝑑𝑒𝑟 , 𝑝𝑎𝑡𝑖𝑒𝑛𝑡 , , , (𝑉: 𝑛𝑢𝑟𝑠𝑒, 𝑜𝑟𝑑𝑒𝑟, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡; 𝐸: (𝑜𝑟𝑑𝑒𝑟, 𝑛𝑢𝑟𝑠𝑒, 𝑛𝑠𝑢𝑏𝑗); (𝑜𝑟𝑑𝑒𝑟, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡, 𝑝𝑟𝑒𝑝_𝑓𝑜𝑟) ), 𝑟𝑒𝑎𝑑)
70
Additional Information
Access Control Rule Representation
Additional Information
Database Model Element Representation
71
Entities
𝐷𝑒({𝑒}, 𝐻)
Attributes of Entities
𝐷𝑎({𝑒}, {𝑎}, 𝐻)
Relationships
𝐷𝑟({𝑟}, {𝑒1}, {𝑒2}, 𝐻)
Additional Information
Step 4: Database Model Patterns
72
%
Entity_2
dobj
NN
VBR
E
Entity_1NNE
nsubj
Relationship: Association
Relationship: Aggregation / Composition
have
part
dobj
NN
VBR
E
wholeNNE
nsubj
Relationship: Inheritance
be
GeneralEntity
prep of
NN
VBR
SpecificEntityNN
nsubj
E E
Attribute
EntityNN
NNA
E
poss
Entity-attributes
Attribute
EntityNN
NNA
E
prep_of
nsubj
EntityNNE
%VB
EntityNNE
%VB
dobj prep_%
EntityNNE
%VB
Entity
Additional Information
Step 4: Extract Database Design
73
ClassifyPatterns
Known Entities and
Relationships
Generate patterns from
templates
Wildcard Patterns
Pattern Set
PatternSearch
Extract Database Design Elements
Manually Identified Patterns
InjectAdditional Patterns
Extracted Access Control Rules
TransformPatterns
75
Additional Information
Negativity
• Specific adjectives (unable)
• Adverbs (not, never)
• Determiners (no, zero, neither)
• Nouns (none, nothing).
• Negative verbs (stop, prohibit, forbid)
• Negative prefixes for verbs
1. What document types contain NFRs in each of the 14
different categories?
2. What characteristics, such as keywords or entities do
sentences assigned to each NFR category have in
common?
3. What machine learning classification algorithm has the
best performance to identify NFRs?
4. What sentence characteristics affect classifier
performance?
76
Additional Information
Study 1: Research Questions
• Started from Cleland-Huang, et al.
• Combined performance and scalability
• Separated access control and audit from security
• Added privacy, recoverability, reliability, and other
77
Additional Information
Study 1: Non-functional Requirement
Categories
J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc, “Automated Classification of Non-functional Requirements,”
Requirements Engineering, vol. 12, no. 2, pp. 103–120, Mar. 2007.
Access Control Privacy
Audit Recoverability
Availability Performance & Scalability
Legal Reliability
Look & Feel Security
Maintenance Usability
Operational Other
Lawrence Chung’s NFRs
accessibility, accountability, accuracy, adaptability, agility, auditability, availability, buffer space performance,
capability, capacity, clarity, code-space performance, cohesiveness, commonality, communication cost,
communication time, compatibility, completeness, component integration time, composability, comprehensibility,
conceptuality, conciseness, confidentiality, configurability, consistency, coordination cost, coordination time,
correctness, cost, coupling, customer evalutation time, customer loyalty, customizability, data-space performance,
decomposability, degradation of service, dependability, development cost, development time, distributivity, diversity,
domain analysis cost, domain analysis time, efficiency, elasticity, enhanceability, evolvability, execution cost,
extensibility, external consistency,fault-tolerance, feasibility, flexibility, formality, generality, guidance, hardware cost,
impact analyzability, independence, informative-ness, inspection cost, inspection time, integrity, inter-operable,
internal consistency, intuitiveness, learnability, main-memory performance, maintainability, maintenance cost,
maintenance time, maturity, mean performance, measurability, mobility, modifiability, modularity, naturalness,
nomadicity, observability, off-peak-period performance, operability, operating cost, peakperiod performance,
performability, performance, planning cost, planning time, plasticity, portability, precision, predictability, process
management time, productivity, project stability, project tracking cost, promptness, prototyping cost, prototyping
time, reconfigurability, recoverability, recovery, reengineering cost, reliability, repeat ability, replaceability,
replicability, response time, responsiveness, retirement cost, reusability, risk analysis cost, risk analysis time,
robustness, safety, scalability, secondary storage performance, security, sensitivity, similarity, simplicity, software
cost, software production time, space boundedness, space performance, specificity, stability, standardizability,
subjectivity, supportability, surety, survivability, susceptibility, sustainability, testability, testing time, throughput, time
performance, timeliness, tolerance, traceability, trainabilìty, transferability, transparency, understandability, uniform
performance, uniformity, usability, user-friendliness, validity, variability, verifìabiìity, versatility, visibility, wrappability
78
79
Additional Information
Study 1: Documents
Document Document Type Size AC AU AV LG LF MT OP PR PS RC RL SC US OT FN NA
CCHIT
Ambulatory
Requirements
Requirement 306 12 27 1 2 0 10 0 0 1 5 2 28 4 8 228 6
iTrust Requirement, Use
Case 1165 439 44 0 2 2 18 2 9 0 9 9 55 2 0 734 376
PromiseData Requirement 792 164 20 36 10 50 26 89 7 75 4 12 71 101 19 340 0 Open EMR Install
Manual Installation Manual 225 3 0 0 0 0 0 5 1 0 6 1 25 0 0 2 184
Open EMR User
Manual User Manual 473 169 0 0 0 14 0 0 0 0 0 0 8 4 0 286 95
NC Public Health
DUA DUA 62 1 0 0 20 0 0 0 4 0 0 0 1 0 0 0 41
US
Medicare/Medicai
d DUA
DUA 140 1 0 0 26 0 0 0 17 0 0 0 0 0 5 2 108
California
Correctional
Health Care
RFP 1893 94 120 9 85 0 133 94 52 13 16 13 193 14 38 987 409
Los Angeles
County EHR RFP 1268 58 37 8 3 2 28 19 3 11 8 13 108 21 10 639 380
HIPAA Combined
Rule CFR 2642 28 8 3 0 0 78 0 213 0 9 0 41 1 0 317 2018
Meaningful Use
Criteria CFR 1435 0 0 0 0 0 0 0 0 0 0 0 8 0 0 116 1311
Health IT
Standards CFR 1475 10 20 0 0 0 119 0 1 0 2 2 71 1 2 164 1146
Total 11876 979 276 57 152 68 413 207 300 100 50 43 563 148 82 3568 6076
80
Study 1/RQ1: What document types contain what categories of
NFRs?
• All evaluated document contained NFRs
• RFPs had a wide variety of NFRs except look and feel
• DUAs contained high frequencies of legal and privacy
• Access control and/or security NFRs appeared in all of
the documents.
• Low frequency of functional and NFRs with CFRs
exemplifies why tool support is critical to efficiently
extract requirements from those documents.
1. What patterns exist among sentences with access
control rules?
2. How frequently do different forms of ambiguity occur in
sentences with access control rules?
3. How effectively does our process detect sentences with
access control rules?
4. How effectively can the subject, action, and resources
elements of ACRs be extracted?
81
Additional Information
Study 2: Research Questions
82
Document Domain
Number of Sentences
Number of ACR Sentences
Number of ACRs
Fleiss’ Kappa
iTrust Healthcare 1160 550 2274 0.58
iTrust for Text2Policy Healthcare 471 418 1070 0.73
IBM Course Management Education 401 169 375 0.82
CyberChair Conf. Mgmt 303 139 386 0.71
Collected ACP Documents Multiple 142 114 258 n/a
Additional Information
Study 2: Investigated Documents
83
Additional Information
Study 2: ACR Patterns
Top ACR Patterns
Pattern Num. of Occurrences
(VB root(NN nsubj)(NN dobj)) 465 (14.1%)
(VB root(NN nsubjpass)) 122 (3.7%)
(VB root(NN nsubj)(NN prep)) 116 (3.5%)
(VB root(NN dobj)) 72 (2.2%)
(VB root(NN prep_%)) 63 (1.9%)
84
Additional Information
Study 2: Ambiguity
Ambiguity Occurrence % in
ACR Sentences
Pronouns 3.2%
“System” / “user” 11.0%
No explicit subject 17.3%
Other ambiguous terms 21.5%
Missing objects 0.2%
Ambiguous terms: “list”, “name”, “record”, “data”, …
85
Additional Information
Study 3: Case Study
System Open Conference System
Version 2.3.6, released May 28th, 2014
Language PHP
Supported DBMSs MySQL, PostgreSQL
Architecture Web-based application
Number of PHP files 1557
Number lines in PHP files 22198
Number of application defined roles 7
Number of database tables 52
Number of fields in database tables 369
86
Additional Information
Study 3: Case Study
Number of sentences 708
Number of ACR sentences 327
Number of ACRs 630
Number of DDE sentences 329
Number of DDEs 1002
Number of Entity DDEs 748 (287 unique)
Number of Entity-Attribute DDEs 99 (75 unique)
Number of Relationship DDEs 155 (82 unique)
Number of DDE sentences with no ACRs 2
88
Motivation Goal Related Work Solution Studies Limitations Future Work
Study 3: Extracting ACRs
Precision Recall F1
OCS 0.53 0.27 0.35
Top 10 ACR Extraction Errors
Number Times
Missed Error Type Pattern
89 FN ( % VB root ( % NN dobj ))
36 FN ( % VB root ( % PRP nsubj )( % NN dobj ))
20 FN ( % VB root ( % NN prep_% ))
18 FN ( % VB root ( % NN nsubj )( % NN dobj ))
17 FP ( % VB root ( % NN nsubjpass ))
12 FN ( % VB root ( % PRP nsubj )( % NN prep_% ))
8 FP ( % VB root ( % PRP nsubj )( % NN dobj ))
5 FN ( allow VB root ( % PRP dobj )( % VB dep ( % NN dobj )))
5 FN ( % VB root ( % NN nsubj )( % NN prep_% ))
5 FN ( % VB root ( % NN nsubjpass ))
Modified version of Levenshtein String Edit Distance
Use words (vertices) instead of characters
89
Additional Information
Sentence Similarity Algorithm
computeVertexDistance(Vertex a, Vertex b)
1: if a = NULL or b = NULL return 1
2: if a.partOfSpeech <> b.partOfSpeech return 1
3: if a.parentCount <> b.parentCount return 1
4: for each parent in a.parents
5: if not b.parents.contains(parent) return 1
6: if a.lemma = b.lemma return 0
7: if a and b are numbers, return 0
8: if ner classes match, return 0
9: wnValue = wordNetSynonyms(a.lemma,b.lemma)
10: if wnValue > 0 return wnValue
11: return 1
Why is This Problem Difficult?
90
• Ambiguity
• Multiple ways to express the same
meaning
• Resolution issues
Motivation: Healthcare Documentation
• HIPAA
• HITECH ACT
• Meaningful Use Stage 1 Criteria
• Meaningful Use Stage 2 Criteria
• Certified EHR (45 CFR Part 170) • ASTM
• HL7
• NIST FIPS PUB 140-2
• HIPAA Omnibus
• NIST Testing Guidelines
• DEA Electronic Prescriptions for Controlled Substances (EPCS)
• Industry Guidelines: CCHIT, EHRA, HL7
• State-specific requirements • North Carolina General Statute § 130A-480 – Emergency Departments
• Organizational policies and procedures
• Project requirements, use cases, design, test scripts, …
• Payment Card Industry: Data Security Standard
93 Scream, Edvard Much, 1895
Dissertation Thesis
Access control rules explicitly and implicitly defined
within unconstrained natural language product artifacts
can be effectively identified and extracted;
Database design elements can be effectively
identified and extracted;
Mappings can be identified among the access control
rules, database design elements, and the physical
database implementation; and
Role-based access control can be established within
a system’s relational database.
94