The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan...

69
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Transcript of The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan...

Page 1: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

The UNIVERSITY of Kansas

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006

Page 2: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2

11/01/2006Semi-supervised learning

AdministrativeAdministrative

Student presentations start from next weekSend me the list of references that you use, if your choices are different from what are listed at the class webpage

Speaker:Allocate enough time for background.

20 minutes for general audience

20 minutes for people working in “your area”

<20 minutes for experts

Keep in mind of five “w” questions: What is it? Why is it important? Who cares? Works or not? and So what?

Page 3: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3

11/01/2006Semi-supervised learning

AdministrativeAdministrative

ParticipantsNeed to ask at least one question

Page 4: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4

11/01/2006Semi-supervised learning

OverviewOverview

Semi-supervised learningSemi-supervised classification

Semi-supervised clustering

Semi-supervised clusteringSearch based methods

Cop K-mean

Seeded K-mean

Constrained K-mean

Similarity based methods

Page 5: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

.

..

.

Page 6: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Page 7: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Page 8: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8

11/01/2006Semi-supervised learning

.

Unsupervised Clustering Example

Unsupervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Page 9: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9

11/01/2006Semi-supervised learning

Unsupervised Clustering Example

Unsupervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Page 10: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10

11/01/2006Semi-supervised learning

Semi-Supervised LearningSemi-Supervised Learning

Combines labeled and unlabeled data during training to improve performance:

Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier.

Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Page 11: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11

11/01/2006Semi-supervised learning

Semi-Supervised Classification Example

Semi-Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Page 12: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12

11/01/2006Semi-supervised learning

Semi-Supervised Classification Example

Semi-Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Page 13: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide13

11/01/2006Semi-supervised learning

Semi-Supervised Classification

Semi-Supervised Classification

Algorithms:Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].

Co-training [Blum:COLT98].

Transductive SVM’s [Vapnik:98,Joachims:ICML99].

Assumptions:Known, fixed set of categories given in the labeled data.

Goal is to improve classification of examples into these known categories.

Page 14: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Semi-Supervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Page 15: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Semi-Supervised Clustering Example

.

...

. .. ..

.

.

....

.

...

Page 16: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16

11/01/2006Semi-supervised learning

Second Semi-Supervised Clustering Example

Second Semi-Supervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Page 17: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17

11/01/2006Semi-supervised learning

Second Semi-Supervised Clustering Example

Second Semi-Supervised Clustering Example

.

...

. .. ..

.

.

....

.

...

Page 18: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18

11/01/2006Semi-supervised learning

Semi-Supervised ClusteringSemi-Supervised Clustering

Can group data using the categories in the initial labeled data.

Can also extend and modify the existing set of categories as needed to reflect other regularities in the data.

Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.

Page 19: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19

11/01/2006Semi-supervised learning

Problem definitionProblem definition

Input:A set of unlabeled objects

Some domain knowledge

Output:A partitioning of the objects into clusters

Objective:Maximum intra-cluster similarity

Minimum inter-cluster similarity

High consistency between the partitioning and the domain knowledge

Page 20: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20

11/01/2006Semi-supervised learning

What is Domain Knowledge?What is Domain Knowledge?

Must-link and cannot-link

Class labels

Ontology

Page 21: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21

11/01/2006Semi-supervised learning

Why semi-supervised clustering?Why semi-supervised clustering?

Why not clustering?Could not incorporate prior knowledge into clustering process

Why not classification?Sometimes there are insufficient labeled data.

Potential applicationsBioinformatics (gene and protein clustering)

Document hierarchy construction

News/email categorization

Image categorization

Page 22: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22

11/01/2006Semi-supervised learning

Semi-Supervised ClusteringSemi-Supervised Clustering

ApproachesSearch-based Semi-Supervised Clustering

Alter the clustering algorithm using the constraints

Similarity-based Semi-Supervised Clustering

Alter the similarity measure based on the constraints

Combination of both

Page 23: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23

11/01/2006Semi-supervised learning

Search-Based Semi-Supervised Clustering

Search-Based Semi-Supervised Clustering

Alter the clustering algorithm that searches for a good partitioning by:

Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].

Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].

Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

Page 24: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24

11/01/2006Semi-supervised learning

Unsupervised KMeans ClusteringUnsupervised KMeans Clustering

KMeans iteratively partitions a dataset into K clusters.

Algorithm:

Initialize K cluster centers randomly. Repeat until convergence:

Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum

Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

}{1l

K

l

l

l

Page 25: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25

11/01/2006Semi-supervised learning

KMeans Objective FunctionKMeans Objective Function

Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:

Initialization of K cluster centers:Totally random

Random perturbation from global mean

Heuristic to ensure well-separated centers etc.

2

1||||

K

l Xx lili

x

Page 26: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26

11/01/2006Semi-supervised learning

K Means ExampleK Means Example

Page 27: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27

11/01/2006Semi-supervised learning

K Means ExampleRandomly Initialize Means

K Means ExampleRandomly Initialize Means

x

x

Page 28: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28

11/01/2006Semi-supervised learning

K Means ExampleAssign Points to Clusters

K Means ExampleAssign Points to Clusters

x

x

Page 29: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means

x

x

Page 30: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30

11/01/2006Semi-supervised learning

K Means ExampleRe-assign Points to Clusters

K Means ExampleRe-assign Points to Clusters

x

x

Page 31: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means

x

x

Page 32: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32

11/01/2006Semi-supervised learning

K Means ExampleRe-assign Points to Clusters

K Means ExampleRe-assign Points to Clusters

x

x

Page 33: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate Means and Converge

K Means ExampleRe-estimate Means and Converge

x

x

Page 34: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34

11/01/2006Semi-supervised learning

Semi-Supervised K-MeansSemi-Supervised K-Means

Constraints (Must-link, Cannot-link)COP K-Means

Partial label information is givenSeeded K-Means (Basu, ICML’02)

Constrained K-Means

Page 35: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35

11/01/2006Semi-supervised learning

COP K-MeansCOP K-Means

COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.

Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated

Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

Based on Wagstaff et al.: ICML01

Page 36: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36

11/01/2006Semi-supervised learning

COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm

Page 37: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Must-link

Determineits label

Assign to the red class

Page 38: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Cannot-link

Determineits label

Assign to the red class

Page 39: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Cannot-link

Determineits label

The clustering algorithm fails

Must-link

Page 40: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40

11/01/2006Semi-supervised learning

EvaluationEvaluation

Rand index: measures the agreement between two partitions, P1 and P2, of the same data set D.

Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D.

a is the number of decisions where P1 and P2 put a pair of objects into the same cluster

b is the number of decisions where two instances are placed in different clusters in both partitions.

Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2):

Page 41: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41

11/01/2006Semi-supervised learning

EvaluationEvaluation

Page 42: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42

11/01/2006Semi-supervised learning

Semi-Supervised K-MeansSemi-Supervised K-Means

Seeded K-Means:Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i.

Seed points are only used for initialization, and not in subsequent steps.

Constrained K-Means:Labeled data provided by user are used to initialize K-Means algorithm.

Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

Based on Basu et al., ICML’02.

Page 43: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43

11/01/2006Semi-supervised learning

Seeded K-MeansSeeded K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points may change.

Page 44: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44

11/01/2006Semi-supervised learning

Seeded K-Means ExampleSeeded K-Means Example

Page 45: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45

11/01/2006Semi-supervised learning

Seeded K-Means ExampleInitialize Means Using Labeled Data

Seeded K-Means ExampleInitialize Means Using Labeled Data

xx

Page 46: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46

11/01/2006Semi-supervised learning

Seeded K-Means ExampleAssign Points to Clusters

Seeded K-Means ExampleAssign Points to Clusters

xx

Page 47: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47

11/01/2006Semi-supervised learning

Seeded K-Means ExampleRe-estimate Means

Seeded K-Means ExampleRe-estimate Means

xx

Page 48: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48

11/01/2006Semi-supervised learning

Seeded K-Means ExampleAssign points to clusters and

Converge

Seeded K-Means ExampleAssign points to clusters and

Converge

xx the label is changed

Page 49: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49

11/01/2006Semi-supervised learning

Constrained K-MeansConstrained K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points will not change.

Page 50: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50

11/01/2006Semi-supervised learning

Constrained K-Means Example

Constrained K-Means Example

Page 51: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51

11/01/2006Semi-supervised learning

Constrained K-Means Example

Initialize Means Using Labeled Data

Constrained K-Means Example

Initialize Means Using Labeled Data

xx

Page 52: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52

11/01/2006Semi-supervised learning

Constrained K-Means Example

Assign Points to Clusters

Constrained K-Means Example

Assign Points to Clusters

xx

Page 53: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53

11/01/2006Semi-supervised learning

Constrained K-Means Example

Re-estimate Means and Converge

Constrained K-Means Example

Re-estimate Means and Converge

xx

Page 54: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide54

11/01/2006Semi-supervised learning

DatasetsDatasets

Data sets: UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances)

Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).

Page 55: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide55

11/01/2006Semi-supervised learning

EvaluationEvaluation

Mutual information

Objective function

Page 56: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide56

11/01/2006Semi-supervised learning

Results: MI and SeedingResults: MI and Seeding

Zero noise in seeds [Small-20 NewsGroup]Semi-Supervised KMeans substantially better than unsupervised KMeans

Page 57: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide57

11/01/2006Semi-supervised learning

Results: Objective function and Seeding

Results: Objective function and Seeding

User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction

Page 58: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide58

11/01/2006Semi-supervised learning

Results: Objective Function and Seeding

Results: Objective Function and Seeding

User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding

Page 59: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide59

11/01/2006Semi-supervised learning

Similarity Based MethodsSimilarity Based Methods

Questions: given a set of points and the class labels, can we learn a distance matrix such that inter-cluster distance are minimized and intra-cluster distance are maximized?

Page 60: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide60

11/01/2006Semi-supervised learning

Distance metric learningDistance metric learning

Define a new distance measure of the form:

Linear transformation of the original data

Page 61: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide61

11/01/2006Semi-supervised learning

Distance metric learningDistance metric learning

Page 62: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide62

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Similarity Based

Semi-Supervised Clustering Example

Similarity Based

Page 63: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide63

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Distances Transformed by Learned Metric

Semi-Supervised Clustering Example

Distances Transformed by Learned Metric

Page 64: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide64

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Clustering Result with Trained Metric

Semi-Supervised Clustering Example

Clustering Result with Trained Metric

Page 65: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide65

11/01/2006Semi-supervised learning

Source: E. Xing, et al. Distance metric learning

EvaluationEvaluation

Page 66: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide66

11/01/2006Semi-supervised learning

Source: E. Xing, et al. Distance metric learning

EvaluationEvaluation

Page 67: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide67

11/01/2006Semi-supervised learning

Additional ReadingsAdditional Readings

Combining Similarity and Search-Based Semi-Supervised Clustering “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Basu, et al.

Ontology based semi-supervised clustering “A framework for ontology-driven subspace clustering”, Liu et al.

Page 68: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide68

11/01/2006Semi-supervised learning

SummarySummary

Semi-supervised clustering

Its applications in Bioinformatics? Not fully explored

Page 69: The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide69

11/01/2006Semi-supervised learning

ReferenceReference

UT machine learning grouphttp://www.cs.utexas.edu/~ml/publication/unsupervised.html

Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf

Constrained K-means clustering with background knowledge

http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf

Some slides are from Jieping Ye at Arizona State