Mining attributes
-
Upload
sandra-alex -
Category
Software
-
view
95 -
download
0
Transcript of Mining attributes
Mining Attribute Lifecycle to Predict Faults
and Incompleteness in Database Applications
Presented by:-Sandra AlexRoll no: 40
Page 2
Outline
INTRODUCTION ATTRIBUTE LIFECYCLE CHARACTERIZATION
PROPOSED APPROACH EXPERIMENT PREDICTION RELATED WORK CONCLUSION REFERENCES
Page 3
Introduction
Each attribute a value created initially via insertion
Referenced, updated or deleted
These occurrences of events, associated with the states attribute lifecycle.
Behaviour of an attribute value from its insertion to final deletion
Extract the attribute lifecycle out of a database application
Page 4
Introduction
Our empirical studies discover, faults and incompleteness in db applications highly associated with attribute lifecycle.
Learned prediction model applied indevelopment and maintenance of database applications
Experiments conducted on PHP systems
Page 5
Attribute Lifecycle Characterization
for each attribute, a value isi. created -> insertionii. referenced -> selectioniii. updated -> updatingiv. deleted -> deletion
These occurrences of events are associated with states , to constitute the attribute lifecycle.
Page 7
Attribute Lifecycle Characterization
programs sustain attribute lifecycle by 4 database operations:
INSERT, SELECT, UPDATE and DELETE formulate the following attributes to
characterize its lifecycle:i. Create (C) -> value of attribute is inserted.ii. Null Create (NC) -> inserted without
valueiii. Control Update (COU) -> not influenced
by existing attribute value & inputs from user and database.
Page 8
Attribute Lifecycle Characterization
iv. Overriding Update (OVU)-> not influenced by existing value.
v. Cumulating Update (CMU) -> influenced by existing value.
vi. Delete (D) : attribute is deleted as a resultof the deletion of the record
vii. Use (U): value is used to support theinsertion, updating or deletion of
other database attributes or output to the external environment.
Page 9
Attribute Lifecycle Characterization
Hence,we characterize the attribute lifecycle by a seven element vector
[m1, m2, m3, m4, m5, m6, m7], where m1, m2,m3, m4, m5, m6, m7
denote whether there is database operation performed on the attribute is of type C, NC, COU, OVU, CMU, D and U respectively.
Page 11
Proposed Approach
B. Extracting Attribute Lifecycle Characterization Data
1) Query Extraction<?php
function exec_query($q)
{ return mysql_query($q); }
$query = "SELECT username FROM users WHERE ";
if (isset($_POST[‘usertype’]))
{ $query .= "usertype =" .$_POST[‘usertype’];//use usertype }
else
{ $query .= "userid=" .$_POST[‘userid’]; //use userid }
exec_query($query);
?>
query can be different in runtime.
Page 13
generates a set of basis paths
encounter a query execution function like “mysql_query”, -> definition of every variable used is retrieved
literals -> replaced by their actual values
variables whose values are not statically known -> replaced by placeholders
parts of query strings with replaced values -> connected
Proposed Approach
Page 14
Proposed Approach
2) Analysis of Attribute Lifecyclequeries are extracted analysed to obtain the attribute lifecycle patterns
by using an SQL grammar parser
CREATE TABLE : first parsed to collect the schema of table
VIEW: mapping of attributes between the view & backup table
Page 15
SELECT :
o query is parsed, table aliases restored by the actual table names, & attributes are identified
o “*” -> schema of table consulted to get all attribute names
o “count(*)” -> not considered, characterized as “USE”
Proposed Approach
Page 16
INSERT:o table name is identified first
o no column list -> all the attributes inserted.
o column list -> attributes are extracted
o “auto incremental” or have not null default values -> treated as inserted by the query
oThese attributes are characterized as “Create”
o explicitly assigned to null -> marked as “Null Create”.
Proposed Approach
Page 17
UPDATE :o collect attribute names o identify the update pattern o attribute assignments in the SET clause are
separated
o analyse the value string to determine the update characteristic
o either COU, OVU or CMU
o attributes used in the WHERE clause -> marked as “Use”
Page 18
DELETE :
o identify table name
omark all the attributes as “Delete
oattributes in the WHERE clause as “Use”
For each query,
attribute names in it -> put into a collection -> create attribute lifecycle vectors.
Page 19
3) Generation of Attribute Lifecycle Vectors
For example,
if there is at least one “Create” characteristic for one attribute,
o the first element of the vector 1
o otherwise 0
no operation on an attribute, all elements set to 0
we generate vectors for all attributes in a database application.
Page 20
A. Data Collectionseed faults in open-source database applications to train our model
we chose systems -> should have very few faults associated with attribute lifecycle.
• source code -> publicly available
• application size -> considerable (transaction number and attribute number)
• mature enough -> very few faults associated with attribute lifecycle.
Experiment
Page 21
“batavi” a web-based e-commerce system;
“webERP”, an accounting & business management system;
“FrontAccounting”, a professional web based system
“OpenBusinessNetwork”, application designed for business;
“SchoolMate”, solution for school administrations.
Experiment
Page 22
attribute lifecycle have a number of common patterns
those which do not follow -> cause errors
we seeded the following common errors
1) Missing function: attributes are provided, function is not catered for during the program design
2) Inconsistency design: correcting the result of a transaction that updates an attribute by “cumulative update” using “overriding update”
3) Redundant function: new programs for different types of operations
4) No Update: new attributes without any update functions
Experiment
Page 23
B. Experimental Designthree classifiers to learn the prediction model
1) C4.5 classifier
decision tree classification algorithm
uses normalized information gain to split data
information gain of one attribute A
Experiment
Page 24
Info(D) is defined as:
pi : probability that one instance belongs to class i
In training process,
each time the classifier chooses one attribute
with the highest normalized information gain
to split the data until all attributes are used.
Experiment
Page 25
Experiment
2) Naïve Bayes classifier generative probabilistic model Bayes’ theorem:
assumed that attributes are independent, we have
For categorical value, the probability P(xi|Ci) is theproportion of the instances in class Ci which have attribute xi.
Page 26
Experiment
3) SVM classifier
Support Vector Machine (SVM)
based on the statistical learning theory trains the classification model by
searching the hyper plane which maximizes the margin between classes
Page 27
C. Model Trainingattributes from the five systems labelled to create the training
set
manually checked, labelled each attribute as “missing function” ,“inconsistency design” ,“redundant function, "no update” or “normal”
Experiment
Page 28
model was trained by three classifiers
for evaluation of trained models 10-fold cross validation on training set
set was randomly partitioned into 10 folds
each time 9 folds of them as training set
and 1 fold was testing set
we computed the average measurements
Experiment
Page 29
D. Assessing Performance
Experiment
probability of detection pd=(tp/(tp+fn))
probability of false alarm pf=(fp/(fp+tn))
precision pr=(tp/(tp+fp))
pd 1 pf 0
Page 31
Prediction
applied prediction model on four database applications ->
to predict whether there are attributes with missing function, inconsistency design, redundant function and no update.
applied our prediction model learned by SVM to these
systems and counted the attributes that were predicted
Page 32
Prediction
designers could take corresponding actions to modify these design faults and incompleteness
further, we manually validate all the attributes predicted
Of all the 107 attributes, 98 are confirmed to be real
prediction precision is 91.59%
Page 33
Conclusion
For each attribute, we extract the set of attributes that can be extracted from code of database applications to characterize its lifecycle.
a characterization vector is formed
Data mining technique is applied to mine the attribute lifecycle using the data collected from database open-source systems.
We seed errors in mature systems and simulate the design faults to train our dataset for our classification method.
Five types of labelled attributes are obtained.
Page 34
Conclusion
Fault and completeness prediction model is then built.
In our experiment, the model achieved 98.04% precision and 98.25% recall on average for SVM
We also applied the model on four database open source applications to predict
conduct more comprehensive experiments on a larger set of systems ,further validate the merits of the proposed approach
Page 35
[1] N. Nagappan and T. Ball, “Static Analysis Tools as Early Indicators of Pre-release Defect Density,” in Proceedings of the 27th International Conference on Software Engineering. ACM, 2005, pp. 580–586.
[2] A. Nikora and J. Munson, “Developing Fault Predictors for Evolving Software Systems,” in Proceedings of Ninth International Software Metrics Symposium, 2003. IEEE, 2003, pp. 338–350.
[3] A. Watson, T. McCabe, and D. Wallace, “Structured testing: A testing methodology using the cyclomatic complexity metric,” NIST special Publication, vol. 500, no. 235, pp. 1–114, 1996.
[4] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan, “Using artificial anomalies to detect unknown and known network intrusions,” Knowledge and Information Systems, vol. 6, no. 5, pp. 507–527, 2004.
References