Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning...
-
Upload
colin-oliver -
Category
Documents
-
view
215 -
download
0
Transcript of Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning...
Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim
Germany
Relational Classification Using Automatically Extracted
Relations by Record Linkage
2
Outline
• Motivation
• Relation Extraction and Multi-Relational Classification Framework
• Relation Extraction
• Multi-Relational Classification
• Evaluation
• Conclusion
3
• Example:
Motivation
P1
P3
P2
Publication Title Author Conference Category
1Classification of scientific publications
John Smith ICDM Data Mining
2 Classification of Hypertext
John Smith KDD
Data Mining
3 Hierarchical Clustering
Dan Miller ICDM
Data Mining
4
Motivation
• Traditional classifiers takes only local attributes like keywords, title and abstract into account
• Assumption: Instances are independent• But: Assumption does not hold
– Instances can be related to other documents by the authorship, citations, same conference etc.
These relations should be exploited and combined in order to improve classification accuracy.
• But: Manuel extraction of relations by experts is expensive
Automatic extraction of relations from noisy attributes.
5
Data Mining
Data Mining
Data Mining
Category
5th International Conference on Data Mining
KDD
ICDM 2005
Conference
Dan MillerHierarchical Clustering
3
John Smith
Classification of Hypertext
2
J. SmithClassification of scientific publications
1
AuthorTitlePublication
• Relation Extraction Component• Extraction of relations from objects with noisy
attributes
• Multi-Relational Classification Component• Use extracted relations instead or additionally to
local attributes for classification
Relation Extraction and Relational Classification Framework
Xx
a
a
R
6
Relation Extraction• Pairwise feature extraction
– from noisy attributes with several similarity measures (e.g. TFIDF, cosine similarity, Levenshtein)
• Probabilistic pairwise decision model– Use extracted similarities as features for a
probabilistic classifier
and build a model on the training data
– And apply it on unknown pairs
• Collective decision model– If is an equivalence relation then use constrained
clustering (e.g. HAC) using the pair wise decision model as a learned similarity measure to transform into a binary relation
VXa :
Pairwise feature extraction
Probabilistic pairwaise decision model
Collective decision model
Attributes
RelationsIR: 2 Xf
1,0IR:ˆ lC
lf
trXyx ),(
trX
R
R
7
Relation ExtractionCollective Decision Model
Initialisation
Must Links
Cannot Links
8
Multi-Relational Classification
• Relational classification problem:– Make use of additional information of related objects
(i.e. their classes or attributes)– Propositionalize the relational data e.g. with:
where
is the neighborhood of
x
xc N
cxcNxxfreq
)'(|')(
.)(,)',(|': xcRxxXxN x
x
9
Multi-Relational Classification
• Algorithm:
1. for each relation R:1 to m(a) Build a undirected weighted graph with (b) Perform relational classification simultaneously for all instances in the test set(c) Output a probability distribution2. Apply ensemble classification to the resulting probability distributions of these relations3. Output final classification
),( EXG
…
…Relational
ClassificationRelational
Classification…
Ensemble Classification
IR: XXw
10
• Simple Relational Methods– Probabilistic Relational Neighbor Classifier (EPRN)
[Macskassy and Provost 2003]
Where is a normalization factor, is the weight and is the iteration
– EPRN2HOP• Takes additionally the neighbors of the direct neighbors into
account if the direct neighborhood size is small
)1('
)( )'|()',(1
)|(
tNx
t xcPxxwZ
xcPx
)1(|''
)1('
)( )''|()'','()'|()',(1
)|('
tdNNx
tNx
t xcPxxwxcPxxwZ
xcPxxx
Multi-Relational Classification
Z t
d
w
11
• Aggregation-based Relational Learning Methods– Use aggregation functions in order to propositionalize
the set-valued attribute
– Use aggregated values as attributes for traditional machine learning methods
– We used Logistic Regression as classifier
Multi-Relational Classification
Category 1
Category 2
Category 3
Category 1
12
• Methods which combine different models • Increases classification accuracy• Usage
– Combine results achieved by relational classification for different relations
– Combine results of relational and local models
• Voting
• Stacking– Use Meta-classifier to learn a model on the results of different
models– Build new instances– Apply cross validation
L
lM l
xcPL
xcP1
)|(1
)|(
),)|(,...)|(,...,)|(,...,)|(( 1111 cxcPxcPxcPxcPx LnLnnew
Ensemble Classification
13
Evaluation• Data
– CompuScience data set• 147 571 scientific papers• 77 topics (categories)• Relations: authors, reviewer, journals
– Cora deduplication data set• 1 295 citations• 112 unique publications• Relation:samePaper
– Cora data set• 3298 papers• 12 categories• Relations: conferences, authors, citations
14
Evaluation – Relation Extraction
Evaluation set
single linkage
complete linkage
average linkage
Xtst 0.90 0.74 0.92
X 0.92 0.71 0.93
F1 measure for finding the SamePaper relation on Cora
Pairwise feature extraction with TFIDF, Levenshtein, Jaccard, Cosine on all attributes
15• The ensemble of relational and content-based text classification achieved
a significantly higher F-measure then the pure text classifier
Evaluation – Multi-Relational Classification
3-fold cross validation on CompuScience for Author, Reviewer and Journal relation
16
EvaluationMulti-Relational Classification using automatically extracted relations
• 50%/50% splits, 10 runs
Author Relation
0.5
0.55
0.6
0.65
0.7
0.75
1 2 3 4 5 6 7 8 9 10
Runs
Acc
ura
cy
Annotated Relation
Learned Relation
17
• Summary:– Presented framework for relation extraction and multi-
relational classification• Automatic relation extraction with record linkage• Relational classification using each extracted relation for
classification and fusing the results with ensemble methods
• Future Work– Evaluate our framework on different data sets and
relations– Evaluate the relational classifiers quality depending on
the quality of the extracted relations
Conclusion and Future Work
18
Questions ?
www.ismll.uni-hildesheim.de
Christine Preisach
Steffen Rendle
Lars Schmidt-Thieme
Thank you