Semi-Supervised Feature Selection for Graph Classification

21
Semi-Supervised Feature Selection for Graph Classification Xiangnan Kong, Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

description

Xiangnan Kong,Philip S. Yu. Semi-Supervised Feature Selection for Graph Classification. Department of Computer Science University of Illinois at Chicago. KDD 2010. Graph Classification - why should we care?. - PowerPoint PPT Presentation

Transcript of Semi-Supervised Feature Selection for Graph Classification

Page 1: Semi-Supervised Feature Selection  for Graph Classification

Semi-Supervised Feature Selection

for Graph ClassificationXiangnan Kong, Philip S. Yu

Department of Computer ScienceUniversity of Illinois at Chicago

KDD 2010

Page 2: Semi-Supervised Feature Selection  for Graph Classification

2

Graph Classification - why should we care?

Program Flows XML DocsChemical Compounds

In real apps, data are not directly represented as feature vectors, but graphs with complex structures.

E.g. G(V, E, l) - y

Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y

Page 3: Semi-Supervised Feature Selection  for Graph Classification

3

Example: Graph Classification

+ - ?

Training Graphs

Testing Graph Drug activity

prediction problem

Given a set of chemical compounds labeled with activities

Predict the activities of testing molecules

Also Program Flows and XML docs

Program Flows XML Docs

Page 4: Semi-Supervised Feature Selection  for Graph Classification

4

Feature Vectors

HHNx1

x2

Subgraph-based Graph Classification

H

H H

OC

H

H

H

H

H

H

H

H1 0

0 1

1

1

Subgraph Patterns

H

G1

G2

g1 g2 g3H H

N

HHN

OOC C

C CO

O

C

CC

CC

CC

CC

CCC

CCC C C

CCCC

C C C

How to mine a set of subgraph patterns in order

to effectively perform graph classification?

Classifierx1 x2

Graph Objects

Feature Vectors

Classifiers

Page 5: Semi-Supervised Feature Selection  for Graph Classification

5

Conventional Methods- Two Components

Two Components:1. Evaluation

(effective)whether a subgraph feature is relevant to graph classification?

2. Search space pruning (efficient)

how to avoid enumerating all subgraph features?

Labeled Graphs

+ -

Discriminative Subgraphs

HHN C

CC

CCC

OOC C

Page 6: Semi-Supervised Feature Selection  for Graph Classification

6

One Problem

Supervised Settings Require a large set of

labeled training graphs

However…Labelin

g a graph is hard !

Labeled Graphs

+ -

Discriminative Subgraphs

HHN C

CC

CCC

Page 7: Semi-Supervised Feature Selection  for Graph Classification

7

Lack of labels -> problems

Supervised Methods:1. Evaluation effective?

require large amount of label information

2. Search space pruning efficient?

pruning performances rely on

large amount of label information

Labeled Graphs

Discriminative Subgraphs

HHN C

CC

CCC

OOC C

+-

Page 8: Semi-Supervised Feature Selection  for Graph Classification

8

Semi-Supervised Feature Selection for Graph Classification

Subgraph Patterns

F(p)

Classification

++ -HH

N C

CC

CCC

OOC C

EvaluationCriteria

Labeled Graphs

+-

Unlabeled Graphs

??

?

Mine useful subgraph patterns using labeled and unlabeled graphs

Page 9: Semi-Supervised Feature Selection  for Graph Classification

9

Two Key Questions to Address

Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

Page 10: Semi-Supervised Feature Selection  for Graph Classification

10

Cannot-Link Graphs in different classes should be far away

Must-Link Graphs in the same class should be close

Separability Unlabeled graphs are able to be separated from each other

What is a good feature?

+ +

--

2

13 Labeled

Labeled

Unlabeled

Unlabeled

LabeledLabeled

Page 11: Semi-Supervised Feature Selection  for Graph Classification

11

Optimization

+ -

+ +

? ?

Cannot-Link Graphs in different classes should be far away

Must-LinkGraphs in the same class should be close

SeparabilityUnlabeled graphs are able to be separated from each other

Evaluation Function:

Page 12: Semi-Supervised Feature Selection  for Graph Classification

12

gSemi Score

Evaluation: gSemi Criterion

gSemi Score:

represents the k-th subgraph feature

In matrix form:

(the sum over all selected features)

NHHgood

CCCCbad

Page 13: Semi-Supervised Feature Selection  for Graph Classification

13

Experiment Results#labeled Graphs

=30

#labeled Graphs

=50

#labeled Graphs

=70

MCF-3 dataset

NCI-H23 dataset

OVCAR-8 dataset

PTC-MMdataset

PTC-FM dataset

Page 14: Semi-Supervised Feature Selection  for Graph Classification

14

Experiment Results

Accuracy(higher is

better)

Our approach performed best at NCI and PTC datasets

gSemi (Semi-Supervised)

Information Gain (Supervised)

Frequent (Unsupervised)

(#labeled graphs=30, MCF-3)

# Selected Features

Page 15: Semi-Supervised Feature Selection  for Graph Classification

15

Two Key Questions to Address

How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

Page 16: Semi-Supervised Feature Selection  for Graph Classification

16

Finding a Needle in a HaystackPattern Search Tree

┴0-edges

1-edge2-edges…

gSpan [Yan et. al ICDM’02]

An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support)

not frequent

Too many frequent subgraph patterns

Find the most useful one(s)

How to find the Best node(s) in this tree

without searching all the nodes? (Branch and Bound to prune the

search space)

Page 17: Semi-Supervised Feature Selection  for Graph Classification

17

Pruning Principle

NHH

best subgraphso far

CC

CC

C H

CC

CC

C

C

CC

CC

C …

CCCC

best scoreso far

upper bound

…currentnode

CC

CC

Pattern Search Tree

currentscore

sub-treeCCCC

C

HCCCC

C

CCCCC

C

gSemi score

If best score ≥ upper

boundWe can prune the entire

sub-tree

Page 18: Semi-Supervised Feature Selection  for Graph Classification

18

Pruning Results

WithoutgSemi pruning

gSemi pruning

(MCF-3 dataset)

Running time(seconds)

(lower is better)

Page 19: Semi-Supervised Feature Selection  for Graph Classification

19

Pruning Results

gSemi pruning

WithoutgSemi pruning

(MCF-3 dataset)

# Subgraphs explored

(lower is better)

Page 20: Semi-Supervised Feature Selection  for Graph Classification

20

Parameters

Accuracy(higher is

better)

(MCF-3 dataset, #label=50, #feature=20)

best (α=1, β=0.1) Semi-Supervised

close toSupervised

close toUnsupervised

α(cannot-link constraints)

(must-linkconstraints)

β

Page 21: Semi-Supervised Feature Selection  for Graph Classification

21

Conclusions

Semi-Supervised Feature Selection for Graph Classification Evaluating subgraph features using both

labeled and unlabeled graphs (effective) Branch&bound pruning the search space

using labeled and unlabeled graphs (efficient)

Thank you!