Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning...

18
Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational Classification Using Automatically Extracted Relations by Record Linkage

Transcript of Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning...

Page 1: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim

Germany

Relational Classification Using Automatically Extracted

Relations by Record Linkage

Page 2: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

2

Outline

• Motivation

• Relation Extraction and Multi-Relational Classification Framework

• Relation Extraction

• Multi-Relational Classification

• Evaluation

• Conclusion

Page 3: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

3

• Example:

Motivation

P1

P3

P2

Publication Title Author Conference Category

1Classification of scientific publications

John Smith ICDM Data Mining

2 Classification of Hypertext

John Smith KDD

Data Mining

3 Hierarchical Clustering

Dan Miller ICDM

Data Mining

Page 4: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

4

Motivation

• Traditional classifiers takes only local attributes like keywords, title and abstract into account

• Assumption: Instances are independent• But: Assumption does not hold

– Instances can be related to other documents by the authorship, citations, same conference etc.

These relations should be exploited and combined in order to improve classification accuracy.

• But: Manuel extraction of relations by experts is expensive

Automatic extraction of relations from noisy attributes.

Page 5: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

5

Data Mining

Data Mining

Data Mining

Category

5th International Conference on Data Mining

KDD

ICDM 2005

Conference

Dan MillerHierarchical Clustering

3

John Smith

Classification of Hypertext

2

J. SmithClassification of scientific publications

1

AuthorTitlePublication

• Relation Extraction Component• Extraction of relations from objects with noisy

attributes

• Multi-Relational Classification Component• Use extracted relations instead or additionally to

local attributes for classification

Relation Extraction and Relational Classification Framework

Xx

a

a

R

Page 6: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

6

Relation Extraction• Pairwise feature extraction

– from noisy attributes with several similarity measures (e.g. TFIDF, cosine similarity, Levenshtein)

• Probabilistic pairwise decision model– Use extracted similarities as features for a

probabilistic classifier

and build a model on the training data

– And apply it on unknown pairs

• Collective decision model– If is an equivalence relation then use constrained

clustering (e.g. HAC) using the pair wise decision model as a learned similarity measure to transform into a binary relation

VXa :

Pairwise feature extraction

Probabilistic pairwaise decision model

Collective decision model

Attributes

RelationsIR: 2 Xf

1,0IR:ˆ lC

lf

trXyx ),(

trX

R

R

Page 7: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

7

Relation ExtractionCollective Decision Model

Initialisation

Must Links

Cannot Links

Page 8: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

8

Multi-Relational Classification

• Relational classification problem:– Make use of additional information of related objects

(i.e. their classes or attributes)– Propositionalize the relational data e.g. with:

where

is the neighborhood of

x

xc N

cxcNxxfreq

)'(|')(

.)(,)',(|': xcRxxXxN x

x

Page 9: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

9

Multi-Relational Classification

• Algorithm:

1. for each relation R:1 to m(a) Build a undirected weighted graph with (b) Perform relational classification simultaneously for all instances in the test set(c) Output a probability distribution2. Apply ensemble classification to the resulting probability distributions of these relations3. Output final classification

),( EXG

…Relational

ClassificationRelational

Classification…

Ensemble Classification

IR: XXw

Page 10: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

10

• Simple Relational Methods– Probabilistic Relational Neighbor Classifier (EPRN)

[Macskassy and Provost 2003]

Where is a normalization factor, is the weight and is the iteration

– EPRN2HOP• Takes additionally the neighbors of the direct neighbors into

account if the direct neighborhood size is small

)1('

)( )'|()',(1

)|(

tNx

t xcPxxwZ

xcPx

)1(|''

)1('

)( )''|()'','()'|()',(1

)|('

tdNNx

tNx

t xcPxxwxcPxxwZ

xcPxxx

Multi-Relational Classification

Z t

d

w

Page 11: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

11

• Aggregation-based Relational Learning Methods– Use aggregation functions in order to propositionalize

the set-valued attribute

– Use aggregated values as attributes for traditional machine learning methods

– We used Logistic Regression as classifier

Multi-Relational Classification

Category 1

Category 2

Category 3

Category 1

Page 12: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

12

• Methods which combine different models • Increases classification accuracy• Usage

– Combine results achieved by relational classification for different relations

– Combine results of relational and local models

• Voting

• Stacking– Use Meta-classifier to learn a model on the results of different

models– Build new instances– Apply cross validation

L

lM l

xcPL

xcP1

)|(1

)|(

),)|(,...)|(,...,)|(,...,)|(( 1111 cxcPxcPxcPxcPx LnLnnew

Ensemble Classification

Page 13: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

13

Evaluation• Data

– CompuScience data set• 147 571 scientific papers• 77 topics (categories)• Relations: authors, reviewer, journals

– Cora deduplication data set• 1 295 citations• 112 unique publications• Relation:samePaper

– Cora data set• 3298 papers• 12 categories• Relations: conferences, authors, citations

Page 14: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

14

Evaluation – Relation Extraction

Evaluation set

single linkage

complete linkage

average linkage

Xtst 0.90 0.74 0.92

X 0.92 0.71 0.93

F1 measure for finding the SamePaper relation on Cora

Pairwise feature extraction with TFIDF, Levenshtein, Jaccard, Cosine on all attributes

Page 15: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

15• The ensemble of relational and content-based text classification achieved

a significantly higher F-measure then the pure text classifier

Evaluation – Multi-Relational Classification

3-fold cross validation on CompuScience for Author, Reviewer and Journal relation

Page 16: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

16

EvaluationMulti-Relational Classification using automatically extracted relations

• 50%/50% splits, 10 runs

Author Relation

0.5

0.55

0.6

0.65

0.7

0.75

1 2 3 4 5 6 7 8 9 10

Runs

Acc

ura

cy

Annotated Relation

Learned Relation

Page 17: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

17

• Summary:– Presented framework for relation extraction and multi-

relational classification• Automatic relation extraction with record linkage• Relational classification using each extracted relation for

classification and fusing the results with ensemble methods

• Future Work– Evaluate our framework on different data sets and

relations– Evaluate the relational classifiers quality depending on

the quality of the extracted relations

Conclusion and Future Work

Page 18: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

18

Questions ?

www.ismll.uni-hildesheim.de

Christine Preisach

[email protected]

Steffen Rendle

[email protected]

Lars Schmidt-Thieme

[email protected]

Thank you