Matching Conceptual Models Using Multivariate Analysis
Click here to load reader
-
Upload
ritu-khare -
Category
Technology
-
view
45 -
download
1
Transcript of Matching Conceptual Models Using Multivariate Analysis
MATCHING CONCEPTUAL MODELS (PART OF THE ‘IBIOSEARCH’ PROJECT)
JUNE 9 2008
Quantitative Methods Ritu Khare
1
Order of the Presentation
Problem and Background
Research Questions
Initial Dataset
Overall Methodology
Representation of Dataset A
Criteria to compare two entities
Generation of dataset B
Multivariate Analysis of dataset B
Results
Case I
Case II
Case III
Case IV
Inferences
Future Work
References
2
1. Problem and Background
Search Interface is represented as a Conceptual Model
The aim is to combine all search interfaces i.e. to combine several conceptual models.
Hence, matching of models is required.
In this project, focus is on matching of entities.
Search X
A:
B: X
A
B
3
Search Y
C:
Y
C
2. Research Questions
Find an Entity Matching Technique(s) to match
entities of two models.
Does this technique (or combination of techniques )
provide a good way to compare two entities?
What other basis of comparison can be used?
4
3. Initial Dataset A
20 Conceptual Models
Example 1:
Example 2:
BLASTP DB Alignments
Domain
Matrix Expect
Gene Patent Sequence
Title
Patent
Accession
No.
Sequence
Number
Gene
Name
Gene
ID
5
4. Overall Methodology 6
Representation of Dataset A into
structured tables
Criteria to compare entities from different
models
(Entity Name, Attribute set, Relationship Set)
Generation of Dataset B
Multivariate Analysis of Dataset B
Conceptual
Models
Analysis
Results
4.1 Representation of dataset A
Every model is represented as
List of entities
Every Entity in a model is represented as
Entity Name
List of attributes
List of relationships
Dataset A has the following columns:
(Model_ID, Entity_name, Attribute_set, Relationship_set)
7
4.2 Criteria to compare two entities
All entities from two different models are compared.
Criteria to compare two entities
Entity Name Similarity
Exact String Matching, Substring Matching
Output: Boolean Variable (True, False)
Attribute Set Similarity
Jaccard Coefficient
Output: Decimal Number (between 0 and 1)
Relationship Set Similarity
Jaccard Coefficient
Output: Decimal Number (between 0 and 1)
8
4.3 Generation of Dataset B
Input: 20 Conceptual Models
Algorithm:
Stem Entity Names and Attribute Names (Porter Stemmer)
Compare each pair of Entities from different models based on
the three criteria (Slide 7)
Output: Table (598 records)
9
Pair#
Name Similarity Attribute Similarity Relationship Similarity
XYZ Yes 0.657 0.004
4.4 Multivariate Analysis of dataset B
Manually annotate if a pair represents similar entities or not. (“Match”
column)
60 matches and 538 mismatches were found.
Is this a good Classification Model?
Can it correctly identify matching and non-matching pair?
Which technique is suitable to answer these questions?
Binary Logistic Regression
Predictive variables are a combination of continuous and categorical variables.
Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)
10
Pair# Match Name
Sim.
Attribute
Sim.
Relationsh
ip Sim.
XYZ Yes Yes 0.657 0.004
5. Results
Binary Logistic Regression
IV: Name_Sim, Attr_Sim, Rel_Sim
DV: Match
Case I: IV = Name_Sim
Case 2: IV = Name_Sim, Attr_Sim
Case 3: IV = Name_Sim, Rel_Sim
Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim
11
5.1 Results: Case 1and Case 2
+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%
+ Variables in the equation for constant and Sim_name are both significant.
+ Nagelkerke R square = .469
- Specificity decreased from 100 to 98.24%, FP increased improved from 0 to 1.75%
- -2 Log Likelihood very high = 309.673
- Cox and Snell R squares = .263
+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%
+ Variables in the equation for constant and Sim_name are both significant.
+ Nagelkerke R square = .470
- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%
- -2 Log Likelihood very high = 309.622
- Cox and Snell R squares = .264
- Variables in the equation for Sim_Attr is not significant.
12
DV=Match, IV=Name_Sim
DV= Match, IV = Name_Sim, Attr_Sim
5.2 Results: Case 3 and 4
+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%
+ Variables in the equation for constant and Sim_name are both significant.
+ Nagelkerke R square = .470
- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%
- -2 Log Likelihood very high = 309.622
- Cox and Snell R squares = .264
- Variables in the equation for Sim_Rel is not significant.
+ Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7%
+ Variables in the equation for constant and Sim_name are both significant.
+ Nagelkerke R square = .471
- Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75%
- -2 Log Likelihood very high = 308.818
- Cox and Snell R squares = .265
- Variables in the equation for Sim_Attr, and Sim_rel are not significant.
13
DV= Match, IV=Name_Sim, Rel_Sim
DV: Match, IV: Name_Sim, Attr_Sim
6. Inferences 14
Out of the three predictive variables (Name_Sim,
Rel_Sim, and Attr_Sim), only Name_Sim is a good
predictor of actual classes of observations.
The misclassified cases mainly represent those
observations which require some domain knowledge
e.g. BLASTP is same as Protein Sequence; and
TBLASTX is same as Nucleotide Sequence.
7. Future Work
Improve Similarity Function
Use of domain dictionaries
Include more number of models
Generate a new classification function
Clustering entities that are found similar
15
References
NAR Journal dataset
Porter’s Stemming Algorithm:
http://tartarus.org/~martin/PorterStemmer/
Sharma, S. (1995), Applied Multivariate Techniques,
John Wiley & Sons, Inc. New York, NY, USA.
INFO 692 Lecture Handouts
16
Questions, Comments, Ideas…?
Thank You 17