Mining attributes

37
to Predict Faults and Incompleteness in Database Applications Presented by:- Sandra Alex Roll no: 40

Transcript of Mining attributes

Mining Attribute Lifecycle to Predict Faults

and Incompleteness in Database Applications

Presented by:-Sandra AlexRoll no: 40

Page 2

Outline

INTRODUCTION ATTRIBUTE LIFECYCLE CHARACTERIZATION

PROPOSED APPROACH EXPERIMENT PREDICTION RELATED WORK CONCLUSION REFERENCES

Page 3

Introduction

Each attribute a value created initially via insertion

Referenced, updated or deleted

These occurrences of events, associated with the states attribute lifecycle.

Behaviour of an attribute value from its insertion to final deletion

Extract the attribute lifecycle out of a database application

Page 4

Introduction

Our empirical studies discover, faults and incompleteness in db applications highly associated with attribute lifecycle.

Learned prediction model applied indevelopment and maintenance of database applications

Experiments conducted on PHP systems

Page 5

Attribute Lifecycle Characterization

for each attribute, a value isi. created -> insertionii. referenced -> selectioniii. updated -> updatingiv. deleted -> deletion

These occurrences of events are associated with states , to constitute the attribute lifecycle.

Page 6

State transition diagram of the attribute lifecycle

Attribute Lifecycle Characterization

Page 7

Attribute Lifecycle Characterization

programs sustain attribute lifecycle by 4 database operations:

INSERT, SELECT, UPDATE and DELETE formulate the following attributes to

characterize its lifecycle:i. Create (C) -> value of attribute is inserted.ii. Null Create (NC) -> inserted without

valueiii. Control Update (COU) -> not influenced

by existing attribute value & inputs from user and database.

Page 8

Attribute Lifecycle Characterization

iv. Overriding Update (OVU)-> not influenced by existing value.

v. Cumulating Update (CMU) -> influenced by existing value.

vi. Delete (D) : attribute is deleted as a resultof the deletion of the record

vii. Use (U): value is used to support theinsertion, updating or deletion of

other database attributes or output to the external environment.

Page 9

Attribute Lifecycle Characterization

Hence,we characterize the attribute lifecycle by a seven element vector

[m1, m2, m3, m4, m5, m6, m7], where m1, m2,m3, m4, m5, m6, m7

denote whether there is database operation performed on the attribute is of type C, NC, COU, OVU, CMU, D and U respectively.

Page 10

Proposed Approach

A. Mining Attribute Lifecycle

Page 11

Proposed Approach

B. Extracting Attribute Lifecycle Characterization Data

1) Query Extraction<?php

function exec_query($q)

{ return mysql_query($q); }

$query = "SELECT username FROM users WHERE ";

if (isset($_POST[‘usertype’]))

{ $query .= "usertype =" .$_POST[‘usertype’];//use usertype }

else

{ $query .= "userid=" .$_POST[‘userid’]; //use userid }

exec_query($query);

?>

query can be different in runtime.

Page 12

Proposed Approach

control flow graph(CFG) for the code

Page 13

generates a set of basis paths

encounter a query execution function like “mysql_query”, -> definition of every variable used is retrieved

literals -> replaced by their actual values

variables whose values are not statically known -> replaced by placeholders

parts of query strings with replaced values -> connected

Proposed Approach

Page 14

Proposed Approach

2) Analysis of Attribute Lifecyclequeries are extracted analysed to obtain the attribute lifecycle patterns

by using an SQL grammar parser

CREATE TABLE : first parsed to collect the schema of table

VIEW: mapping of attributes between the view & backup table

Page 15

SELECT :

o query is parsed, table aliases restored by the actual table names, & attributes are identified

o “*” -> schema of table consulted to get all attribute names

o “count(*)” -> not considered, characterized as “USE”

Proposed Approach

Page 16

INSERT:o table name is identified first

o no column list -> all the attributes inserted.

o column list -> attributes are extracted

o “auto incremental” or have not null default values -> treated as inserted by the query

oThese attributes are characterized as “Create”

o explicitly assigned to null -> marked as “Null Create”.

Proposed Approach

Page 17

UPDATE :o collect attribute names o identify the update pattern o attribute assignments in the SET clause are

separated

o analyse the value string to determine the update characteristic

o either COU, OVU or CMU

o attributes used in the WHERE clause -> marked as “Use”

Page 18

DELETE :

o identify table name

omark all the attributes as “Delete

oattributes in the WHERE clause as “Use”

For each query,

attribute names in it -> put into a collection -> create attribute lifecycle vectors.

Page 19

3) Generation of Attribute Lifecycle Vectors

For example,

if there is at least one “Create” characteristic for one attribute,

o the first element of the vector 1

o otherwise 0

no operation on an attribute, all elements set to 0

we generate vectors for all attributes in a database application.

Page 20

A. Data Collectionseed faults in open-source database applications to train our model

we chose systems -> should have very few faults associated with attribute lifecycle.

• source code -> publicly available

• application size -> considerable (transaction number and attribute number)

• mature enough -> very few faults associated with attribute lifecycle.

Experiment

Page 21

“batavi” a web-based e-commerce system;

“webERP”, an accounting & business management system;

“FrontAccounting”, a professional web based system

“OpenBusinessNetwork”, application designed for business;

“SchoolMate”, solution for school administrations.

Experiment

Page 22

attribute lifecycle have a number of common patterns

those which do not follow -> cause errors

we seeded the following common errors

1) Missing function: attributes are provided, function is not catered for during the program design

2) Inconsistency design: correcting the result of a transaction that updates an attribute by “cumulative update” using “overriding update”

3) Redundant function: new programs for different types of operations

4) No Update: new attributes without any update functions

Experiment

Page 23

B. Experimental Designthree classifiers to learn the prediction model

1) C4.5 classifier

decision tree classification algorithm

uses normalized information gain to split data

information gain of one attribute A

Experiment

Page 24

Info(D) is defined as:

pi : probability that one instance belongs to class i

In training process,

each time the classifier chooses one attribute

with the highest normalized information gain

to split the data until all attributes are used.

Experiment

Page 25

Experiment

2) Naïve Bayes classifier generative probabilistic model Bayes’ theorem:

assumed that attributes are independent, we have

For categorical value, the probability P(xi|Ci) is theproportion of the instances in class Ci which have attribute xi.

Page 26

Experiment

3) SVM classifier

Support Vector Machine (SVM)

based on the statistical learning theory trains the classification model by

searching the hyper plane which maximizes the margin between classes

Page 27

C. Model Trainingattributes from the five systems labelled to create the training

set

manually checked, labelled each attribute as “missing function” ,“inconsistency design” ,“redundant function, "no update” or “normal”

Experiment

Page 28

model was trained by three classifiers

for evaluation of trained models 10-fold cross validation on training set

set was randomly partitioned into 10 folds

each time 9 folds of them as training set

and 1 fold was testing set

we computed the average measurements

Experiment

Page 29

D. Assessing Performance

Experiment

probability of detection pd=(tp/(tp+fn))

probability of false alarm pf=(fp/(fp+tn))

precision pr=(tp/(tp+fp))

pd 1 pf 0

Page 30

• pd>87

• pf<1.81

• SVM>C4.5

• C4.5>naïve

Bayes• SVM:

• pd>95%• pf<0.07%

Page 31

Prediction

applied prediction model on four database applications ->

to predict whether there are attributes with missing function, inconsistency design, redundant function and no update.

applied our prediction model learned by SVM to these

systems and counted the attributes that were predicted

Page 32

Prediction

designers could take corresponding actions to modify these design faults and incompleteness

further, we manually validate all the attributes predicted

Of all the 107 attributes, 98 are confirmed to be real

prediction precision is 91.59%

Page 33

Conclusion

For each attribute, we extract the set of attributes that can be extracted from code of database applications to characterize its lifecycle.

a characterization vector is formed

Data mining technique is applied to mine the attribute lifecycle using the data collected from database open-source systems.

We seed errors in mature systems and simulate the design faults to train our dataset for our classification method.

Five types of labelled attributes are obtained.

Page 34

Conclusion

Fault and completeness prediction model is then built.

In our experiment, the model achieved 98.04% precision and 98.25% recall on average for SVM

We also applied the model on four database open source applications to predict

conduct more comprehensive experiments on a larger set of systems ,further validate the merits of the proposed approach

Page 35

[1] N. Nagappan and T. Ball, “Static Analysis Tools as Early Indicators of Pre-release Defect Density,” in Proceedings of the 27th International Conference on Software Engineering. ACM, 2005, pp. 580–586.

[2] A. Nikora and J. Munson, “Developing Fault Predictors for Evolving Software Systems,” in Proceedings of Ninth International Software Metrics Symposium, 2003. IEEE, 2003, pp. 338–350.

[3] A. Watson, T. McCabe, and D. Wallace, “Structured testing: A testing methodology using the cyclomatic complexity metric,” NIST special Publication, vol. 500, no. 235, pp. 1–114, 1996.

[4] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan, “Using artificial anomalies to detect unknown and known network intrusions,” Knowledge and Information Systems, vol. 6, no. 5, pp. 507–527, 2004.

References

Page 36

QUESTIONS

Page 37

Thank you