MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.
![Page 1: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/1.jpg)
MACHINE LEARNING TECHNIQUES IN BIO-
INFORMATICS
Elena Marchiori
IBIVU
Vrije Universiteit Amsterdam
![Page 2: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/2.jpg)
Summary
• Machine Learning
• Supervised Learning: classification
• Unsupervised Learning: clustering
![Page 3: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/3.jpg)
Machine Learning (ML)
• Construct a computational model from a dataset describing properties of an unknown (but existent) system.
Computational model
System (unknown)observations
?
properties
ML
![Page 4: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/4.jpg)
Supervised Learning
• The dataset describes examples of input-output behaviour of a unknown (but existent) system.
• The algorithm tries to find a function ‘equivalent’ to the system.
• ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines.
![Page 5: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/5.jpg)
Supervised Learning
Training data
ML algorithm
model predictionnew observation
System (unknown)observationsproperty of interest
?
supervisor
Unsupervised learning
![Page 6: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/6.jpg)
Example: A Classification Problem
• Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”
• Use features such as length, width, lightness, fin shape & number, mouth position, etc.
• Steps1. Preprocessing (e.g., background
subtraction)2. Feature extraction 3. Classification
example from Duda & Hart
![Page 7: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/7.jpg)
Classification in Bioinformatics
• Computational diagnostic: early cancer detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 8: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/8.jpg)
Classification Techniques
• Naïve Bayes
• K Nearest Neighbour
• Support Vector Machines (next lesson)
• …
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 9: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/9.jpg)
Bayesian Approach
• Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis
• Prior knowledge can be combined with observed data to determine hypothesis
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities
Kathleen McKeown’s slides
![Page 10: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/10.jpg)
Bayesian Approach
• Assign the most probable target value, given <a1,a2,…an>
• VMAP=argmax P(vj| a1,a2,…an)
• Using Bayes Theorem:• VMAP=argmax P(a1,a2,…an|vj)P(vi)
vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV
• Bayesian learning is optimal• Easy to estimate P(vi) by counting in training data• Estimating the different P(a1,a2,…an|vj) not feasible(we would need a training set of size proportional to the number of
possible instances times the number of classes) Kathleen McKeown’s slides
![Page 11: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/11.jpg)
Bayes’ Rules
• Product RuleP(a Λ b) = P(a|b)P(b)= P(b|a)P(a)
• Bayes’ ruleP(a|b)=P(b|a)P(a)
P(b)• In distribution form:
P(Y|X)=P(X|Y)P(Y) = αP(X|Y)P(Y) P(X)
Kathleen McKeown’s slides
![Page 12: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/12.jpg)
Naïve Bayes
• Assume independence of attributes– P(a1,a2,…an|vj)=∏P(ai|vj)
i
• Substitute into VMAP formula– VNB=argmax P(vj)∏P(ai|vj)
vjV i
Kathleen McKeown’s slides
![Page 13: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/13.jpg)
VNB=argmax P(vj)∏P(ai|vj) vjV
S-length S-width P-length Class
1 high high high Versicolour
2 low high low Setosa
3 low high low Verginica
4 low high med Verginica
5 high high high Versicolour
6 high high med Setosa
7 high high low Setosa
8 high high high Versicolour
9 high high high Versicolour
Kathleen McKeown’s slides
![Page 14: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/14.jpg)
Estimating Probabilities
• What happens when the number of data elements is small?
• Suppose true P(S-length=low|verginica)=.05• There are only 2 instances with C=Verginica• We estimate probability by nc/n using the training set• Then #S-length =low |Verginica must = 0• Then, instead of .05 we use estimated probability of 0• Two problems
• Biased underestimate of probability• This probability term will dominate if future query contains S-
length=low
Kathleen McKeown’s slides
![Page 15: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/15.jpg)
Instead: use m-estimate
• Use priors as well
• nc+mp n+m
– Where p = prior estimate of P(S-length=low|verginica)
– m is a constant called the equivalent sample size» Determines how heavily to weight p relative to the
observed data» Typical method: assume a uniform prior of an
attribute (e.g. if values low,med,high -> p =1/3)
Kathleen McKeown’s slides
![Page 16: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/16.jpg)
K-Nearest Neighbour• Memorize the training data
• Given a new example, find its k nearest neighbours, and output the majority vote class.
• Choices: – How many neighbours?– What distance measure?
![Page 17: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/17.jpg)
Application in Bioinformatics • A Regression-based K nearest neighbor algorithm for gene
function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7
1. For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data
2. Construct predictor of gene pair similarity, e.g. logistic regressionH: f(p,1),…,f(p,m) H(f(p,1),…,f(p,m)) such thatH high value if genes of p have similar functions.
Given a new gene g find kNN using H as distancePredict the functional classes C1, .., Cn of g with confidence equal toConfidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set
of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong)
![Page 18: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/18.jpg)
Classification: CV error
• Training error– Empirical error
• Error on independent test set – Test error
• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV
N samples
splitting
1/n samples for testing
Summarize CV error rate
N-1/n samples for training
Count errors
Supervised learning
![Page 19: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/19.jpg)
Two schemes of cross validation
N samples
LOO
Train and test the gene-selector and the
classifier
Count errors
N samples
Gene selection
Train and test the classifier
Count errors
LOO
CV2CV1
Supervised learning
![Page 20: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/20.jpg)
Difference between CV1 and CV2
• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true
error
• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!
Supervised learning
![Page 21: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/21.jpg)
Significance of classification results
• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:
• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered
Supervised learning
![Page 22: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/22.jpg)
Unsupervised Learning
ML for unsupervised learning attempts to discover interesting structure in the available data
Unsupervised learning
![Page 23: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/23.jpg)
Unsupervised Learning
• The dataset describes the structure of an unknown (but existent) system.
• The computer program tries to identify structure of the system (clustering, data compression).
• ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson).
![Page 24: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/24.jpg)
Clustering Clustering is one of the most important unsupervised
learning processes for organizing objects into groups whose members are similar in some way.
Clustering finds structures in a collection of unlabeled data.
A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.
![Page 25: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/25.jpg)
Clustering Algorithms
• Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n.
• The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown.
• Popular methods: hierarchical, k-means, SOM, …
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 26: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/26.jpg)
Hierarchical Clustering
DendrogramVenn Diagram of Clustered Data
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
![Page 27: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/27.jpg)
Hierarchical Clustering (Cont.)
• Multilevel clustering: level 1 has n clusters level n has one cluster.
• Agglomerative HC: starts with singleton and merge clusters.
• Divisive HC: starts with one sample and split clusters.
![Page 28: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/28.jpg)
Nearest Neighbor Algorithm
• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
![Page 29: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/29.jpg)
Nearest Neighbor, Level 2, k = 7 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
![Page 30: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/30.jpg)
Nearest Neighbor, Level 3, k = 6 clusters.
![Page 31: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/31.jpg)
Nearest Neighbor, Level 4, k = 5 clusters.
![Page 32: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/32.jpg)
Nearest Neighbor, Level 5, k = 4 clusters.
![Page 33: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/33.jpg)
Nearest Neighbor, Level 6, k = 3 clusters.
![Page 34: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/34.jpg)
Nearest Neighbor, Level 7, k = 2 clusters.
![Page 35: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/35.jpg)
Nearest Neighbor, Level 8, k = 1 cluster.
![Page 36: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/36.jpg)
Calculate the similarity between all possible
combinations of two profiles
Two most similar clusters are grouped together to form
a new cluster
Calculate the similarity between the new cluster and
all remaining clusters.
Hierarchical Clustering
Keys• Similarity• Clustering
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 37: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/37.jpg)
Clustering in Bioinformatics
• Microarray data quality checking– Does replicates cluster together?– Does similar conditions, time points, tissue types
cluster together?
• Cluster genes Prediction of functions of unknown genes by known ones
• Cluster samples Discover clinical characteristics (e.g. survival, marker status) shared by samples.
• Promoter analysis of commonly regulated genes
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 38: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/38.jpg)
Functional significant gene clusters
Two-way clustering
Gene clusters
Sample clusters
![Page 39: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/39.jpg)
Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses.
Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795.
![Page 40: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/40.jpg)
Similarity Measurements• Pearson Correlation
Nx
x
x 1
Two profiles (vectors) and
])(][)([
))((),(
1
2
1
2
1
N
i yi
N
i xi
N
i yixipearson
mymx
mymxyxC
Ny
y
y 1
x
y
x
y+1 Pearson Correlation – 1
N
n nx xN
m1
1
N
n ny yN
m1
1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 41: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/41.jpg)
Similarity Measurements
• Pearson Correlation: Trend Similarity
ab
5.0
2.0ac
1),( caCpearson
1),( baCpearson
1),( cbCpearson
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 42: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/42.jpg)
Similarity Measurements
• Euclidean Distance
N
n nn yxyxd1
2)(),(
Nx
x
x 1
Ny
y
y 1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 43: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/43.jpg)
Similarity Measurements
• Euclidean Distance: Absolute difference
ab
5.0
2.0ac
5875.1),( cad
8025.2),( bad
2211.3),( cbd
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 44: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/44.jpg)
Clustering
C1
C2
C3
Merge which pair of clusters?
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 45: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/45.jpg)
+
+
Clustering
Single Linkage
C1
C2
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters
Tend to generate “long chains”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 46: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/46.jpg)
+
+
Clustering
Complete Linkage
C1
C2
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters
Tend to generate “clumps”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 47: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/47.jpg)
+
+
Clustering
Average Linkage
C1
C2
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 48: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/48.jpg)
+
+
Clustering
Average Group Linkage
C1
C2
Dissimilarity between two clusters = Distance between two cluster means.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 49: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/49.jpg)
Considerations
• What genes are used to cluster samples?– Expression variation– Inherent variation– Prior knowledge
(irrelevant genes)– Etc.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
![Page 50: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/50.jpg)
K-means Clustering– Initialize the K cluster representatives w’s, e.g. to randomly
chosen examples. – Assign each input example x to the cluster c(x) with the
nearest corresponding weight vector:
– Update the weights:
– Increment n by 1 and go until no noticeable changes of the cluster representatives occur.
)n(wxmin argc(x) jj
jcluster toassigned examples ofnumber the with
/)1(j c(x) that suchx
j
jj
n
nxnw
Unsupervised learning
![Page 51: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/51.jpg)
Example I
Initial Data and Seeds Final Clustering
Unsupervised learning
![Page 52: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/52.jpg)
Example II
Initial Data and Seeds Final Clustering
Unsupervised learning
![Page 53: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/53.jpg)
The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.
That is, the brain processes the external signals in a topology-preserving way
Mimicking the way the brain learns, our clustering system should be able to do the same thing.
SOM: Brain’s self-organization
Unsupervised learning
![Page 54: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/54.jpg)
Self-Organized Map: ideaSelf-Organized Map: idea
Data: vectors XT = (X1, ... Xd) from d-dimensional space.
Grid of nodes, with local processor (called neuron) in each node.
Local processor # j has d adaptive parameters W(j).
Goal: change W(j) parameters to recover data clusters in X space.Unsupervised learning
![Page 55: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/55.jpg)
Training processTraining process
o
o
oox
x
xx=dane
siatka neuronów
N-wymiarowa
xo=pozycje wag neuronów
o
o o
o
o
o
o
o
przestrzeń danych
wagi wskazująna punkty w N-D
w 2-D
Java demos: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
![Page 56: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/56.jpg)
Concept of the SOM Input space Reduced feature space
s1
s2Mn
Sr
Ba
Clustering and ordering of the cluster centers in a two dimensional grid
Cluster centers (code vectors) Place of these code vectors in the reduced space
Unsupervised learning
![Page 57: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/57.jpg)
Concept of the SOM
Ba
Mn
Sr
…
SA3
We can use it for visualization
We can
use it fo
r classification
Mg
We can use it for clustering
SA3
Unsupervised learning
![Page 58: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/58.jpg)
SOM: learning algorithm
• Initialization. n=0. Choose random small values for weight vectors components.
• Sampling. Select an x from the input examples.• Similarity matching. Find the winning neuron i(x) at iteration n:
• Updating: adjust the weight vectors of all neurons using the following rule
• Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed.
||)()(||minarg)( nwnxxi jj
)(-x(n) )( )()()1( )()( nwdhnnwnw jjxixijj
Unsupervised learning
![Page 59: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/59.jpg)
Neighborhood Function
– Gaussian neighborhood function:
– dji: lateral distance of neurons i and j
• in a 1-dimensional lattice | j - i |
• in a 2-dimensional lattice || rj - ri || where rj is the position of
neuron j in the lattice.
2
2
2exp)(
ijiji
ddh
Unsupervised learning
![Page 60: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/60.jpg)
Initial h function (Example )
Unsupervised learning
![Page 61: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/61.jpg)
Some examples of real-life applications
Helsinki University of Technology web site http://www.cis.hut.fi/research/refs/Contains > 5000 papers on SOM and its applications:• Brain research: modeling of formation of various
topographical maps in motor, auditory, visual and somatotopic areas.
• Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data ....
• Data compression (images and audio), information filtering.• Medical and technical diagnostics.
Unsupervised learning
![Page 62: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/62.jpg)
Issues in Clustering
• How many clusters?– User parameter– Use model selection criteria (Bayesian Information Criterion) with
penalization term which considers model complexity. See e.g. X-means: http://www2.cs.cmu.edu/~dpelleg/kmeans.html
• What similarity measure?– Euclidean distance– Correlation coefficient– Ad-hoc similarity measures
Unsupervised learning
![Page 63: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/63.jpg)
Validation of clustering results
• External measures– According to some external knowledge
– Consideration of bias and subjectivity
• Internal measures– Quality of clusters according to the data
– Compactness and separation
– Stability
– …
See e.g. J.Handl, J.Knowles, D.B.Kell
Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15):3201-3212, 2005
Unsupervised learning
![Page 64: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/64.jpg)
Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring
T.R. Golub et al., Science 286, 531 (1999)
Bioinformatics ApplicationBioinformatics Application
Unsupervised learning
![Page 65: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/65.jpg)
Identification of cancer types
• Why is Identification of Cancer Class (tumor sub-type) important?– Cancers of Identical grade can have widely variable
clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia).
• Tradition Method:– Morphological appearance.
– Enzyme-based histochemical analyses.
– Immunophenotyping.
– Cytogenetic analysis.
Golub et al 1999Unsupervised learning
![Page 66: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/66.jpg)
Class Prediction
• How could one use an initial collection of samples belonging to know classes to create a class Predictor?– Identification of Informative Genes
– Weighted Vote
Golub et al slidesUnsupervised learning
![Page 67: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/67.jpg)
Data
• Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis.
• Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
Golub et al slidesUnsupervised learning
![Page 68: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/68.jpg)
Validation of Gene Voting
• Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.
• Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy.
Golub et al slidesUnsupervised learning
![Page 69: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/69.jpg)
Class Discovery
• Can cancer classes be discovered automatically based on gene expression?– Cluster tumors by gene expression– Determine whether the putative classes
produced are meaningful.
Golub et al slidesUnsupervised learning
![Page 70: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/70.jpg)
Cluster tumors
Self-organization Map (SOM) Mathematical cluster analysis for recognizing and
clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at
random. Iteratively adjust the nodes.
Golub et al slidesUnsupervised learning
![Page 71: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/71.jpg)
Validation of SOM
• Prediction based on cluster A1 and A2:– 24/25 of the ALL samples from initial dataset were
clustered in group A1
– 10/13 of the AML samples from initial dataset were clustered in group A2
Golub et al slidesUnsupervised learning
![Page 72: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/72.jpg)
Validation of SOM
• How could one evaluate the putative cluster if the “right” answer were not known?– Assumption: class discovery could be tested by class
prediction.• Testing of Assumption:
– Construct Predictors based on clusters A1 and A2.– Construct Predictors based on random clusters
Golub et al slidesUnsupervised learning
![Page 73: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/73.jpg)
Validation of SOM
• Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains.
Golub et al slidesUnsupervised learning
![Page 74: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/74.jpg)
Validation of SOM
Golub et al slidesUnsupervised learning
![Page 75: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d4a5503460f94a27bb6/html5/thumbnails/75.jpg)
CONCLUSION
• In Machine Learning, every technique has its assumptions and constraints, advantages and limitations
• My view:– First perform simple data analysis before applying fancy high tech ML
methods
– Possibly use different ML techniques and then ensemble results
– Apply correct cross validation method!
– Check for significance of results (permutation test, stability of selected genes)
– Work in collaboration with data producer (biologist, pathologist) when possible!
ML in bioinformatics