Semi-Supervised Feature Selection for Graph Classification
description
Transcript of Semi-Supervised Feature Selection for Graph Classification
Semi-Supervised Feature Selection
for Graph ClassificationXiangnan Kong, Philip S. Yu
Department of Computer ScienceUniversity of Illinois at Chicago
KDD 2010
2
Graph Classification - why should we care?
Program Flows XML DocsChemical Compounds
In real apps, data are not directly represented as feature vectors, but graphs with complex structures.
E.g. G(V, E, l) - y
Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y
3
Example: Graph Classification
+ - ?
Training Graphs
Testing Graph Drug activity
prediction problem
Given a set of chemical compounds labeled with activities
Predict the activities of testing molecules
Also Program Flows and XML docs
Program Flows XML Docs
4
Feature Vectors
HHNx1
x2
Subgraph-based Graph Classification
H
H H
OC
H
H
H
H
H
H
H
H1 0
0 1
1
1
Subgraph Patterns
…
…
…
H
G1
G2
g1 g2 g3H H
N
HHN
OOC C
C CO
O
C
CC
CC
CC
CC
CCC
CCC C C
CCCC
C C C
How to mine a set of subgraph patterns in order
to effectively perform graph classification?
Classifierx1 x2
Graph Objects
Feature Vectors
Classifiers
5
Conventional Methods- Two Components
Two Components:1. Evaluation
(effective)whether a subgraph feature is relevant to graph classification?
2. Search space pruning (efficient)
how to avoid enumerating all subgraph features?
Labeled Graphs
+ -
Discriminative Subgraphs
HHN C
CC
CCC
OOC C
6
One Problem
Supervised Settings Require a large set of
labeled training graphs
However…Labelin
g a graph is hard !
Labeled Graphs
+ -
Discriminative Subgraphs
HHN C
CC
CCC
7
Lack of labels -> problems
Supervised Methods:1. Evaluation effective?
require large amount of label information
2. Search space pruning efficient?
pruning performances rely on
large amount of label information
Labeled Graphs
Discriminative Subgraphs
HHN C
CC
CCC
OOC C
+-
8
Semi-Supervised Feature Selection for Graph Classification
Subgraph Patterns
F(p)
Classification
++ -HH
N C
CC
CCC
OOC C
EvaluationCriteria
Labeled Graphs
+-
Unlabeled Graphs
??
?
Mine useful subgraph patterns using labeled and unlabeled graphs
9
Two Key Questions to Address
Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)
Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs?
(efficient)
10
Cannot-Link Graphs in different classes should be far away
Must-Link Graphs in the same class should be close
Separability Unlabeled graphs are able to be separated from each other
What is a good feature?
+ +
--
2
13 Labeled
Labeled
Unlabeled
Unlabeled
LabeledLabeled
11
Optimization
+ -
+ +
? ?
Cannot-Link Graphs in different classes should be far away
Must-LinkGraphs in the same class should be close
SeparabilityUnlabeled graphs are able to be separated from each other
Evaluation Function:
12
gSemi Score
Evaluation: gSemi Criterion
gSemi Score:
represents the k-th subgraph feature
In matrix form:
(the sum over all selected features)
NHHgood
CCCCbad
13
Experiment Results#labeled Graphs
=30
#labeled Graphs
=50
#labeled Graphs
=70
MCF-3 dataset
NCI-H23 dataset
OVCAR-8 dataset
PTC-MMdataset
PTC-FM dataset
14
Experiment Results
Accuracy(higher is
better)
Our approach performed best at NCI and PTC datasets
gSemi (Semi-Supervised)
Information Gain (Supervised)
Frequent (Unsupervised)
(#labeled graphs=30, MCF-3)
# Selected Features
15
Two Key Questions to Address
How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)
How to prune the subgraph search space using both labeled and unlabeled graphs?
(efficient)
16
Finding a Needle in a HaystackPattern Search Tree
┴0-edges
1-edge2-edges…
gSpan [Yan et. al ICDM’02]
An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support)
not frequent
Too many frequent subgraph patterns
Find the most useful one(s)
How to find the Best node(s) in this tree
without searching all the nodes? (Branch and Bound to prune the
search space)
17
Pruning Principle
NHH
best subgraphso far
CC
CC
C H
CC
CC
C
C
CC
CC
C …
CCCC
best scoreso far
upper bound
…currentnode
CC
CC
Pattern Search Tree
currentscore
sub-treeCCCC
C
HCCCC
C
CCCCC
C
…
gSemi score
If best score ≥ upper
boundWe can prune the entire
sub-tree
18
Pruning Results
WithoutgSemi pruning
gSemi pruning
(MCF-3 dataset)
Running time(seconds)
(lower is better)
19
Pruning Results
gSemi pruning
WithoutgSemi pruning
(MCF-3 dataset)
# Subgraphs explored
(lower is better)
20
Parameters
Accuracy(higher is
better)
(MCF-3 dataset, #label=50, #feature=20)
best (α=1, β=0.1) Semi-Supervised
close toSupervised
close toUnsupervised
α(cannot-link constraints)
(must-linkconstraints)
β
21
Conclusions
Semi-Supervised Feature Selection for Graph Classification Evaluating subgraph features using both
labeled and unlabeled graphs (effective) Branch&bound pruning the search space
using labeled and unlabeled graphs (efficient)
Thank you!