Matching Conceptual Models Using Multivariate Analysis

17

Click here to load reader

Transcript of Matching Conceptual Models Using Multivariate Analysis

Page 1: Matching Conceptual Models Using Multivariate Analysis

MATCHING CONCEPTUAL MODELS (PART OF THE ‘IBIOSEARCH’ PROJECT)

JUNE 9 2008

Quantitative Methods Ritu Khare

1

Page 2: Matching Conceptual Models Using Multivariate Analysis

Order of the Presentation

Problem and Background

Research Questions

Initial Dataset

Overall Methodology

Representation of Dataset A

Criteria to compare two entities

Generation of dataset B

Multivariate Analysis of dataset B

Results

Case I

Case II

Case III

Case IV

Inferences

Future Work

References

2

Page 3: Matching Conceptual Models Using Multivariate Analysis

1. Problem and Background

Search Interface is represented as a Conceptual Model

The aim is to combine all search interfaces i.e. to combine several conceptual models.

Hence, matching of models is required.

In this project, focus is on matching of entities.

Search X

A:

B: X

A

B

3

Search Y

C:

Y

C

Page 4: Matching Conceptual Models Using Multivariate Analysis

2. Research Questions

Find an Entity Matching Technique(s) to match

entities of two models.

Does this technique (or combination of techniques )

provide a good way to compare two entities?

What other basis of comparison can be used?

4

Page 5: Matching Conceptual Models Using Multivariate Analysis

3. Initial Dataset A

20 Conceptual Models

Example 1:

Example 2:

BLASTP DB Alignments

Domain

Matrix Expect

Gene Patent Sequence

Title

Patent

Accession

No.

Sequence

Number

Gene

Name

Gene

ID

5

Page 6: Matching Conceptual Models Using Multivariate Analysis

4. Overall Methodology 6

Representation of Dataset A into

structured tables

Criteria to compare entities from different

models

(Entity Name, Attribute set, Relationship Set)

Generation of Dataset B

Multivariate Analysis of Dataset B

Conceptual

Models

Analysis

Results

Page 7: Matching Conceptual Models Using Multivariate Analysis

4.1 Representation of dataset A

Every model is represented as

List of entities

Every Entity in a model is represented as

Entity Name

List of attributes

List of relationships

Dataset A has the following columns:

(Model_ID, Entity_name, Attribute_set, Relationship_set)

7

Page 8: Matching Conceptual Models Using Multivariate Analysis

4.2 Criteria to compare two entities

All entities from two different models are compared.

Criteria to compare two entities

Entity Name Similarity

Exact String Matching, Substring Matching

Output: Boolean Variable (True, False)

Attribute Set Similarity

Jaccard Coefficient

Output: Decimal Number (between 0 and 1)

Relationship Set Similarity

Jaccard Coefficient

Output: Decimal Number (between 0 and 1)

8

Page 9: Matching Conceptual Models Using Multivariate Analysis

4.3 Generation of Dataset B

Input: 20 Conceptual Models

Algorithm:

Stem Entity Names and Attribute Names (Porter Stemmer)

Compare each pair of Entities from different models based on

the three criteria (Slide 7)

Output: Table (598 records)

9

Pair#

Name Similarity Attribute Similarity Relationship Similarity

XYZ Yes 0.657 0.004

Page 10: Matching Conceptual Models Using Multivariate Analysis

4.4 Multivariate Analysis of dataset B

Manually annotate if a pair represents similar entities or not. (“Match”

column)

60 matches and 538 mismatches were found.

Is this a good Classification Model?

Can it correctly identify matching and non-matching pair?

Which technique is suitable to answer these questions?

Binary Logistic Regression

Predictive variables are a combination of continuous and categorical variables.

Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)

10

Pair# Match Name

Sim.

Attribute

Sim.

Relationsh

ip Sim.

XYZ Yes Yes 0.657 0.004

Page 11: Matching Conceptual Models Using Multivariate Analysis

5. Results

Binary Logistic Regression

IV: Name_Sim, Attr_Sim, Rel_Sim

DV: Match

Case I: IV = Name_Sim

Case 2: IV = Name_Sim, Attr_Sim

Case 3: IV = Name_Sim, Rel_Sim

Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim

11

Page 12: Matching Conceptual Models Using Multivariate Analysis

5.1 Results: Case 1and Case 2

+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%

+ Variables in the equation for constant and Sim_name are both significant.

+ Nagelkerke R square = .469

- Specificity decreased from 100 to 98.24%, FP increased improved from 0 to 1.75%

- -2 Log Likelihood very high = 309.673

- Cox and Snell R squares = .263

+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%

+ Variables in the equation for constant and Sim_name are both significant.

+ Nagelkerke R square = .470

- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%

- -2 Log Likelihood very high = 309.622

- Cox and Snell R squares = .264

- Variables in the equation for Sim_Attr is not significant.

12

DV=Match, IV=Name_Sim

DV= Match, IV = Name_Sim, Attr_Sim

Page 13: Matching Conceptual Models Using Multivariate Analysis

5.2 Results: Case 3 and 4

+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%

+ Variables in the equation for constant and Sim_name are both significant.

+ Nagelkerke R square = .470

- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%

- -2 Log Likelihood very high = 309.622

- Cox and Snell R squares = .264

- Variables in the equation for Sim_Rel is not significant.

+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%

+ Variables in the equation for constant and Sim_name are both significant.

+ Nagelkerke R square = .471

- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%

- -2 Log Likelihood very high = 308.818

- Cox and Snell R squares = .265

- Variables in the equation for Sim_Attr, and Sim_rel are not significant.

13

DV= Match, IV=Name_Sim, Rel_Sim

DV: Match, IV: Name_Sim, Attr_Sim

Page 14: Matching Conceptual Models Using Multivariate Analysis

6. Inferences 14

Out of the three predictive variables (Name_Sim,

Rel_Sim, and Attr_Sim), only Name_Sim is a good

predictor of actual classes of observations.

The misclassified cases mainly represent those

observations which require some domain knowledge

e.g. BLASTP is same as Protein Sequence; and

TBLASTX is same as Nucleotide Sequence.

Page 15: Matching Conceptual Models Using Multivariate Analysis

7. Future Work

Improve Similarity Function

Use of domain dictionaries

Include more number of models

Generate a new classification function

Clustering entities that are found similar

15

Page 16: Matching Conceptual Models Using Multivariate Analysis

References

NAR Journal dataset

Porter’s Stemming Algorithm:

http://tartarus.org/~martin/PorterStemmer/

Sharma, S. (1995), Applied Multivariate Techniques,

John Wiley & Sons, Inc. New York, NY, USA.

INFO 692 Lecture Handouts

16

Page 17: Matching Conceptual Models Using Multivariate Analysis

Questions, Comments, Ideas…?

Thank You 17