Matching Conceptual Models Using Multivariate Analysis

MATCHING CONCEPTUAL MODELS (PART OF THE ‘IBIOSEARCH’ PROJECT)

JUNE 9 2008

Quantitative Methods Ritu Khare

1

Order of the Presentation

Problem and Background

Research Questions

Initial Dataset

Overall Methodology

Representation of Dataset A

Criteria to compare two entities

Generation of dataset B

Multivariate Analysis of dataset B

Results

Case I

Case II

Case III

Case IV

Inferences

Future Work

References

2

1. Problem and Background

Search Interface is represented as a Conceptual Model

The aim is to combine all search interfaces i.e. to combine several conceptual models.

Hence, matching of models is required.

In this project, focus is on matching of entities.

Search X

A:

B: X

A

B

3

Search Y

C:

Y

C

2. Research Questions

Find an Entity Matching Technique(s) to match

entities of two models.

Does this technique (or combination of techniques )

provide a good way to compare two entities?

What other basis of comparison can be used?

4

3. Initial Dataset A

20 Conceptual Models

Example 1:

Example 2:

BLASTP DB Alignments

Domain

Matrix Expect

Gene Patent Sequence

Title

Patent

Accession

No.

Sequence

Number

Gene

Name

Gene

ID

5

4. Overall Methodology 6

Representation of Dataset A into

structured tables

Criteria to compare entities from different

models

(Entity Name, Attribute set, Relationship Set)

Generation of Dataset B

Multivariate Analysis of Dataset B

Conceptual

Models

Analysis

Results

4.1 Representation of dataset A

Every model is represented as

List of entities

Every Entity in a model is represented as

Entity Name

List of attributes

List of relationships

Dataset A has the following columns:

(Model_ID, Entity_name, Attribute_set, Relationship_set)

7

4.2 Criteria to compare two entities

All entities from two different models are compared.

Criteria to compare two entities

Entity Name Similarity

Exact String Matching, Substring Matching

Output: Boolean Variable (True, False)

Attribute Set Similarity

Jaccard Coefficient

Output: Decimal Number (between 0 and 1)

Relationship Set Similarity

Jaccard Coefficient

Output: Decimal Number (between 0 and 1)

8

4.3 Generation of Dataset B

Input: 20 Conceptual Models

Algorithm:

Stem Entity Names and Attribute Names (Porter Stemmer)

Compare each pair of Entities from different models based on

the three criteria (Slide 7)

Output: Table (598 records)

9

Pair#

Name Similarity Attribute Similarity Relationship Similarity

XYZ Yes 0.657 0.004

4.4 Multivariate Analysis of dataset B

Manually annotate if a pair represents similar entities or not. (“Match”

column)

60 matches and 538 mismatches were found.

Is this a good Classification Model?

Can it correctly identify matching and non-matching pair?

Which technique is suitable to answer these questions?

Binary Logistic Regression

Predictive variables are a combination of continuous and categorical variables.

Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)

10

Pair# Match Name

Sim.

Attribute

Sim.

Relationsh

ip Sim.

XYZ Yes Yes 0.657 0.004

5. Results

Binary Logistic Regression

IV: Name_Sim, Attr_Sim, Rel_Sim

DV: Match

Case I: IV = Name_Sim

Case 2: IV = Name_Sim, Attr_Sim

Case 3: IV = Name_Sim, Rel_Sim

Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim

11

5.1 Results: Case 1and Case 2

+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%

+ Variables in the equation for constant and Sim_name are both significant.

+ Nagelkerke R square = .469

- Specificity decreased from 100 to 98.24%, FP increased improved from 0 to 1.75%

- -2 Log Likelihood very high = 309.673

- Cox and Snell R squares = .263




- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%



- Variables in the equation for Sim_Attr is not significant.

12

DV=Match, IV=Name_Sim

DV= Match, IV = Name_Sim, Attr_Sim

5.2 Results: Case 3 and 4







- Variables in the equation for Sim_Rel is not significant.







- Variables in the equation for Sim_Attr, and Sim_rel are not significant.

13

DV= Match, IV=Name_Sim, Rel_Sim

DV: Match, IV: Name_Sim, Attr_Sim

6. Inferences 14

Out of the three predictive variables (Name_Sim,

Rel_Sim, and Attr_Sim), only Name_Sim is a good

predictor of actual classes of observations.

The misclassified cases mainly represent those

observations which require some domain knowledge

e.g. BLASTP is same as Protein Sequence; and

TBLASTX is same as Nucleotide Sequence.

7. Future Work

Improve Similarity Function

Use of domain dictionaries

Include more number of models

Generate a new classification function

Clustering entities that are found similar

15

References

NAR Journal dataset

Porter’s Stemming Algorithm:

http://tartarus.org/~martin/PorterStemmer/

Sharma, S. (1995), Applied Multivariate Techniques,

John Wiley & Sons, Inc. New York, NY, USA.

INFO 692 Lecture Handouts

16

http://tartarus.org/~martin/PorterStemmer/

Questions, Comments, Ideas…?

Thank You 17

Matching Conceptual Models Using Multivariate Analysis

Technology

Transcript of Matching Conceptual Models Using Multivariate Analysis