Universty of São Paulo Institute of Mathematics and Statistics Computer Science Department...

Post on 13-Dec-2015

216 views 3 download

Tags:

Transcript of Universty of São Paulo Institute of Mathematics and Statistics Computer Science Department...

Universty of São Paulo

Institute of Mathematics and Statistics

Computer Science Department

Introduction to Pattern Recognition

A Bioinformatics Viewpoint

Roberto Marcondes Cesar Junior (IME-USP)

http://www.ime.usp.br/~cesar/cesar@ime.usp.br

OrganizationOrganization

Introduction

Case Study

Generalizing the ConceptsConcluding Remarks

IntroductionIntroduction

Pattern Recogntion To recognize is to classify. To classify an object is to label the object. An object is anything we want to recognize.

Applications Computer Vision Speech recognition Bioinformatics ...

Case StudyCase Study

We are interested in studying some disease, which we will call disease X.

Hypothesis:There are some different types of disease X,

which will be called A, B, C...

Question:What is the expression behaviour of a given set of

genes g1, g2, ...gn with respect to A, B, C...?

Case StudyCase Study

First step: gathering some sick people

C1 C5 C6C2 C3 C4

Case StudyCase Study

Second step:Each case will be analyzed based on the gene

expression with respect to g1, g2, ...gn

Therefore, we have to measure gene expression of the genes of interest for each case C1, C2, ..., C6

Ex: Microarrays

Case StudyCase Study

1 5 6 2 3 4

Case StudyCase Study

............

...3.0710

...17.02

...0920

Case StudyCase Study

1 5 6 2 3 4

............

...3.0710

...17.02

...0920

............

...3121

...4.0120

...1530

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

M1 M 2 M 3 M 4 M 5 M 6

Case StudyCase Study

............

...3.0710

...17.02

...0920

...

1

7.0

2

...

0

9

20Expression vector: stacking the array lines

Case StudyCase Study

...

1

7.0

2

...

0

9

20

...

4.0

1

20

...

1

5

30

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Case StudyCase Study

Brief: Each case C1, ..., C6 is represented by a

vector v1, v2, ..., v6

Each coordinate in the expression vectors corresponds to the expression of a given gene gi

Case StudyCase Study

Some PR terminology:

...

1

7.0

2

...

0

9

20

Feature

Feature Vector

Case StudyCase Study

Trainning Set

Sample

Case StudyCase Study

Let’s simplify things: We’re only interested in two genes g1

and g2.

15

13 ,

12

16 ,

17

15 ,

1

4 ,

5

2 ,

4

1

v1 v2 v3 v4 v5 v6

2

1 : vectorsFeatureg

g

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

Case StudyCase Study

g2

g1

v1

v2

v3

v4

v5

v6

Type A

Type B FeatureSpace

Classes

Case Study: the classifierCase Study: the classifier

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

Input

Trainning set with unlabelled samples

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

Output

Classes of thefeature space

Unsupervised classifier:Clustering algorithm

Case Study: Linkage AlgorithmCase Study: Linkage Algorithm

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

Case Study: Linkage AlgorithmCase Study: Linkage Algorithm

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20 v2

v1

v3

v4

v6

v5

Dendrogram

Case Study: Visualization Case Study: Visualization

Intermezzo: vectors as signals

...

...

...

...

...

0

9

20

0 2 4 6 8 10 12 14 16 180

5

10

15

20

Case Study: Visualization Case Study: Visualization

Intermezzo: signals as images

0 2 4 6 8 10 12 14 16 180

5

10

15

20

Generalizing the conceptsGeneralizing the concepts

Putting all together: datamining

Concluding remarksConcluding remarks

Supervised classification

Which classifier should be used?

Be careful: clustering algorithms always find clusters!

Normalization issues

Concluding remarksConcluding remarks

A key problem: which genes should be used?

Or: which features should be selected?

Well-known problem in PR: Dimensionality Reduction

Concluding remarksConcluding remarks

Y1

Y2

Concluding remarksConcluding remarks

Feature space 1

Concluding remarksConcluding remarks

Feature space 2

Concluding remarksConcluding remarks

Feature space 3