Clustering and machine learning for gene expression...

47
Clustering and machine learning for gene expression data Stefan Enroth Original slides by Torgeir R. Hvidsten The Linnaeus Centre for Bioinformatics

Transcript of Clustering and machine learning for gene expression...

Page 1: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Clustering and machine learning for gene expression data

Stefan Enroth

Original slides by Torgeir

R. Hvidsten The Linnaeus Centre for Bioinformatics

Page 2: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.182

Machine learning: to learn general concepts from examples

Real world Data (Feature space)

Knowledge (classes)

Assumed functional relationship partially described by the examples

Data collection

Abstraction

Machine learning

Page 3: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.183

Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

Molecular function: the tasks performed by individual gene products

Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

Cellular component: subcellular

structures,

locations, and macromolecular complexes

Gene Ontology

Page 4: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.184

Protein structure classification (CATH)

Page 5: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.185

Microarray

Page 6: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.186

Numerical data

Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97

-0.04 = log(2.3/2.4) = log(“Red/Green”)

M < 100

N ≈

10k-100k

Page 7: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.187

Next Generation RNA-Sequencing

Nature Reviews Genetics 10, 57-63 (January 2009)

Page 8: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.188

Numerical data

• Ideally, counts of the actual number of transcripts in the cell

• Also, information on isoforms, splice variants etc

• Ongoing reaserch!

Wang & Sandberg et al, Nature 456, 470-476 (27 November 2008)

Page 9: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.189

Data analysis goalsWhat to study?

Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.

Classes of genes; expression profiles of genes with similar biological function

Both of the above

Page 10: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1810

Data analysis methods

Unsupervised learning

(clustering, class discovery); used to “discover”

natural groups of

genes/experiments e.g.–

discover subclasses of a form of cancer that is clinically homogenous

Supervised learning; used to “learn”

a model of a set of predefined classes of genes/experiments e.g.–

diagnosis of cancer/subclasses of cancer

Page 11: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1811

Clustering analysis

Need to define;•

measure of similarity

algorithm for using the measure of similarity to discover natural groups in the data

The number of ways to divide n

items into k clusters: kn/k!

Example: 10500/10! = 2.756 ×

10493

Page 12: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1812

Measure of similarity

E1

E2

d

What is similar?

Euclidean distanceAppl. Dependent

Page 13: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1813

Hierarchical clustering

INPUT: n

genes/experiments

Consider each gene/experiment as an individual cluster and initiate an n

×

n

distance matrix d

Repeat–

identify the two most similar clusters in d (i.e. smallest number in d)

merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)

OUTPUT: A tree of merged genes/experiments (called a dendrogram)

Page 14: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1814

Hierarchical clustering (cont’d)Popular inter-cluster similarity measures:

(a) single linkage (smallest), (b) complete linkage (largest) and(c) average linkage

Page 15: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1815

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

Page 16: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1816

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

Page 17: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g.

dEN = 2 dEDu = 7 dSpI = 1

Intercluster strategy: SINGLE LINKAGE

Page 18: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

E N Da Du G Fr Sp I P H FiE 0N 2 0

Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr

Page 19: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0

Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da N

Page 20: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0

Du 5 9 7 0G 4 7 6 5 0Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp

Page 21: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 4

Sp I Fr

Da N E Du G P H Fi

Sp I Fr 0Da N 5 0

E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp E

Page 22: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 5

E Da N

Sp I Fr Du G P H Fi

E Da N 0

Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0

I

12345678

Fr Da NSp EP

Page 23: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 6

P Sp I Fr

E Da N Du G H Fi

P Sp I Fr 0

E Da N 5 0

Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0

I

12345678

Fr Da NSp EP G

Page 24: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 7

G E Da N

P Sp I Fr Du H Fi

G E Da N 0

P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Page 25: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 8

Du G E Da N

P Sp I Fr H Fi

Du G E Da N 0

P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Page 26: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 9

P Sp I Fr Du G E Da N H Fi

P Sp I Fr Du G E Da N 0

H 8 0Fi 9 8 0

I

12345678

Fr Da NSp EP G Du H

Page 27: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 10

Fi H

P Sp I Fr Du G E Da N

Fi H 0P Sp I

Fr Du G E Da N 8 0

I

12345678

Fr Da NSp EP G Du H Fi

Page 28: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1828

Any data mining result needs to be consistent BOTH with the data and current knowledge!

Page 29: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1829

Evaluation of clusters

I

12345678

Fr Da NSp EP G Du H Fi

Clusters may be evaluated according to how well they describe current knowledge

RomanSlavicGermanicUgro-Finnish

Page 30: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1830

Hierarchical clustering: properties

Huge memory requirements: stores the n

×

n

matrix•

Running time: O(n3)

Deterministic: produces the same clustering each time

Nice visualization: dendrogram•

Number of clusters can be selected using the dendrogram

Different interpretations depending on distance method used.

Page 31: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1831

K-means clustering

Split the data into k

random clustersRepeat:

calculate the centroid

of each cluster–

(re-)assign each gene/experiment to the closest centroid

stop if no new assignments are made

Page 32: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Example of K-means: two dimensions

Initial clustersK=2

Page 33: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

Calculate centroids

xx

Page 34: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

(Re-)assign

xx

Page 35: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

Calculate centroids

x

x

Page 36: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

(Re-)assign

x

x

Page 37: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

Calculate centroid

x

x

Page 38: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

(Re-)assign

No new assignments! STOP

x

x

Page 39: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1839

K-means: properties

Low memory usage•

Running time: O(n)

Improves iteratively: not trapped in previous mistakes

Non-deterministic: will in general produce different clusters with different initializations

Number of clusters must be decided in advance–

Algorithms that “grow”

number of clusters if inter-

cluster variance is too high (Growing k-means, 2002).

Page 40: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1840

Hierarchical vs. k-means

Hierarchical clustering: –

computationally expensive -> relatively small data sets

nice visualization, no. of clusters can be selected–

deterministic

cannot correct early ”mistakes”•

K-means: –

computationally efficient -> large data sets

predefined no. of clusters–

non-deterministic -> should be run several times

iterative improvement

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

Page 41: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1841

Hierarchical vs. k-means•

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

Page 42: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1842

Supervised learning•

Uses examples of known classes to learn a model

• Examples are, for instance, expression profiles of genes with known classes (clinical state or function)

• The model can be e.g.

hyperplanes

separating classes in n dimensions (SVM)–

artificial neural networks

decision trees (Random Forrest, C4.5) –

IF-THEN rules (Rough Sets)

• Can be used for e.g.

diagnostics–

predicting gene function for unknown genes

Page 43: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1843

Support Vector MachinesMaximum marginseparating ”hyperplane”

Support vectors

Soft margin

Decision boundary

Page 44: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1844

Artificial Neural Networks (ANN)

Input layer Output layer

x1

x2

x3

x4

f(x)

…x1

xn ⎪⎪⎪

⎪⎪⎪

∑=

>

otherwise

n1i

if

1

01 ixiww1

wn

ABC

Page 45: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,

PortugalGroup 3: Benelux countries, Switzerland,

Austria, Italy, Germany

Christian Democrats > 16

Group 3

Yes

Agrarians > 4

YesGroup 1 Group 2

No

Decision tree learning

No

Page 46: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)

Rule learning: Rough sets

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany

Page 47: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1847

Supervised vs. clustering

Clustering+

class discovery

+

robust towards incorrect knowledgeSupervised

+

evaluation+

predictive/descriptive model

+

based on actual knowledge rather than idealized hypotheses