BioInformatics (3)
description
Transcript of BioInformatics (3)
![Page 1: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/1.jpg)
BioInformatics (3)
![Page 2: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/2.jpg)
Computational Issues• Data Warehousing:
– Organising Biological Information into a Structured Entity (World’s Largest Distributed DB)
• Function Analysis (Numerical Analysis) :– Gene Expression Analysis : Applying sophisticated data mining/Visualisation to
understand gene activities within an environment (Clustering )– Integrated Genomic Study : Relating structural analysis with functional analysis
• Structure Analysis (Symbolic Analysis) : – Sequence Alignment: Analysing a sequence using comparative methods against existing
databases to develop hypothesis concerning relatives (genetics) and functions (Dynamic Programming and HMM)
– Structure prediction : from a sequence of a protein to predict its 3D structure (Inductive LP)
![Page 3: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/3.jpg)
Data Warehousing : Mapping Biologic into Data Logic
![Page 4: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/4.jpg)
Structure Analysis :Alignments & Scores
Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATAScore= 5(+1) + 3(-1) = 2
Suffix (shotgun assembly) ACCACACA ::: ACACCATAScore= 3(+1) =3
Local (motif) ACCACACA ::::ACACCATAScore= 4(+1) = 4
![Page 5: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/5.jpg)
A comparison of the homology search and the motif search for functional interpretation of sequence information.
Homology Search Motif Search
New sequence
Retrieval
Similarsequence
Expertknowledge
Sequence interpretation
Sequence database(Primary data)
Knowledgeacquisition
Motif library(Empirical rules)
Expertknowledge
New sequence
Inference
Sequence interpretation
![Page 6: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/6.jpg)
Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method
Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structurealignment
Optimization algorithms Dynamic programming
(DP) Simulated annealing (SA) Genetic algorithms (GA) Markov Chain Monte
Carlo (MCMC:Metropolis and Gibbssamplers)
Hopfield neural networkStructure/functionprediction
ab initio prediction RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structure prediction
Knowledge basedprediction
Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structure prediction
Pattern recognition andlearning algorithms Discriminant analysis Neural networks Support vector machines Hidden Markov models
(HMM) Formal grammar CART
Molecular classification Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification
Clustering algorithms Hierarchical, k-means, etc PCA, MDS, etc Self-organizing maps, etc
![Page 7: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/7.jpg)
(Whole genome) Gene Expression Analysis
• Quantitative Analysis of Gene Activities (Transcription Profiles)
Gene Expression
![Page 8: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/8.jpg)
Biotinylated RNAfrom experiment
GeneChip expressionanalysis probe array
Image of hybridized probe array
Each probe cell containsmillions of copies of a specific oligonucleotide probe
Streptavidin-phycoerythrinconjugate
![Page 9: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/9.jpg)
(Sub)cellular inhomogeneity
( see figure)
Cell-cycle differences in expression.
XIST RNA localized on inactive X-chromosome
![Page 10: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/10.jpg)
Cluster Analysis
General Purpose: To divide samples intohomogeneous groups based on a set of features.
Gene Expression Analysis: To find co-regulatedgenes.
Protein/protein complex
Genes
DNA regulatory elements
![Page 11: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/11.jpg)
Functional Analysis via Gene Expression
Gene Expression Data
Distance/Similarity Matrix
Gene Clusters
Regulatory Elements / Gene Functions
Pairwise Measures
Clustering
Motif Searching/...
![Page 12: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/12.jpg)
Clustering Algorithms
A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. Also, the clustering algorithm finds the centroid of a group of data sets.To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.
![Page 13: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/13.jpg)
Clusters of Two-Dimensional Data
![Page 14: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/14.jpg)
Key Terms in Cluster Analysis
• Distance & Similarity measures
• Hierarchical & non-hierarchical
• Single/complete/average linkage
• Dendrograms & ordering
![Page 15: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/15.jpg)
Distance Measures: Minkowski Metric
r rp
iii
p
p
yxyxd
yyyy
xxxx
pyx
||),(
)(
)(
1
21
21
by defined is metric Minkowski The
:features have both and objects two Suppose
ref
![Page 16: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/16.jpg)
Most Common Minkowski Metrics
||max),(
||),(
1
||),(
2
1
1
2 2
1
iipi
p
iii
p
iii
yxyxd
r
yxyxd
r
yxyxd
r
) distance sup"(" 3,
distance) (Manhattan 2,
) distance (Euclidean 1,
![Page 17: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/17.jpg)
An Example
.4}3,4{max
.734
.5342 22
:distance sup"" 3,
:distance Manhattan 2,
:distance Euclidean 1,
4
3
x
y
![Page 18: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/18.jpg)
Manhattan distance is called Hamming distance when all features are binary.
1101111110000111010011100100100110
1716151413121110987654321
GeneBGeneA
Gene Expression Levels Under 17 Conditions (1-High,0-Low)
. :Distance Hamming 5141001 )#()#(
![Page 19: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/19.jpg)
Similarity Measures: Correlation Coefficient
. and :averages
)()(
))((),(
1
1
1
1
1 1
22
1
p
iip
p
iip
p
i
p
iii
p
iii
yyxx
yyxx
yyxxyxs
1),( yxs
![Page 20: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/20.jpg)
Similarity Measures: Correlation Coefficient
Time
Gene A
Gene B Gene A
Time
Gene B
Expression LevelExpression Level
Expression Level
Time
Gene A
Gene B
![Page 21: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/21.jpg)
Distance-based Clustering • Assign a distance measure between data • Find a partition such that:
– Distance between objects within partition (i.e. same cluster) is minimized
– Distance between objects from different clusters is maximised
• Issues :– Requires defining a distance (similarity) measure in situation
where it is unclear how to assign it– What relative weighting to give to one attribute vs another?– Number of possible partition is super-exponential
![Page 22: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/22.jpg)
Normalized Expression Data
hierarchical & non-
a b c d
![Page 23: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/23.jpg)
Hierarchical Clustering Techniques
At the beginning, each object (gene) isa cluster. In each of the subsequentsteps, two closest clusters will mergeinto one cluster until there is only onecluster left.
![Page 24: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/24.jpg)
Hierarchical ClusteringGiven a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:
1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.
2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
3.Compute distances (similarities) between the new cluster and each of the old clusters.
4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
![Page 25: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/25.jpg)
The distance between two clusters is defined as the distance between
• Single-Link Method / Nearest Neighbor
• Complete-Link / Furthest Neighbor
• Their Centroids.
• Average of all cross-cluster pairs.
![Page 26: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/26.jpg)
Computing Distances• single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.
• complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of
the other cluster.
• average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster
to any member of the other cluster.
![Page 27: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/27.jpg)
Single-Link Method
ba
453652
cba
dcb
Distance Matrix
Euclidean Distance
453,
cba
dc
453652
cba
dcb4,, cbad
(1) (2) (3)
a,b,ccc d
a,b
d da,b,c,d
![Page 28: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/28.jpg)
Complete-Link Method
ba
453652
cba
dcb
Distance Matrix
Euclidean Distance
465,
cba
dc
453652
cba
dcb6,,
badc
(1) (2) (3)
a,b
cc d
a,b
d c,da,b,c,d
![Page 29: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/29.jpg)
Compare Dendrograms
a b c d a b c d
2
4
6
0
Single-Link Complete-Link
![Page 30: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/30.jpg)
Ordered dendrograms
2 n-1 linear orderings of n elements (n= # genes or conditions)
Maximizing adjacent similarity is impractical. So order by:•Average expression level, •Time of max induction, or•Chromosome positioning
Eisen98
![Page 31: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/31.jpg)
Which clustering methods do you suggest for the following two-dimensional data?
![Page 32: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/32.jpg)
Nadler and Smith, Pattern Recognition Engineering, 1993
![Page 33: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/33.jpg)
Problems of Hierarchical Clustering
• It concerns more about complete tree structure than the optimal number of clusters.
• There is no possibility of correcting for a poor initial partition.
• Similarity and distance measures rarely have strict numerical significance.
![Page 34: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/34.jpg)
Normalized Expression Data
Tavazoie et al. 1999 (http://arep.med.harvard.edu)
Non-hierarchical clustering
![Page 35: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/35.jpg)
Clustering by K-means
•Given a set S of N p-dimension vectors without any prior knowledge about the set, the K-means clustering algorithm forms K disjoint nonempty subsets such that each subset minimizes some measure of dissimilarity locally. The algorithm will globally yield an optimal dissimilarity of all subsets. •K-means algorithm has time complexity O(RKN) where K is the number of desired clusters and R is the number of iterations to converges.
•Euclidean distance metric between the coordinates of any two genes in the space reflects ignorance of a more biologically relevant measure of distance. K-means is an unsupervised, iterative algorithm that minimizes the within-cluster sum of squared distances from the cluster mean. •The first cluster center is chosen as the centroid of the entire data set and subsequent centers are chosen by finding the data point farthest from the centers already chosen. 200-400 iterations.
![Page 36: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/36.jpg)
K-Means Clustering Algorithm
1) Select an initial partition of k clusters
2) Assign each object to the cluster with the closest center:
3) Compute the new centers of the clusters:
4) Repeat step 2 and 3 until no object changes cluster
SXXnXSC n
n
ii
,...,,/)( 1
1
![Page 37: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/37.jpg)
Time-point 1
Tim
e-po
int 3
Tim
e-po
int 2
Gene 1Gene 2
Normalized Expression Data from microarrays
T1 T2 T3Gene 1
Gene N.
Representation of expression data
dij
![Page 38: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/38.jpg)
Identifying prevalent expression patterns (gene clusters)
Time-point 1
Tim
e-po
int 3
Tim
e-po
int 2
-1.8
-1.3
-0.8
-0.3
0.2
0.7
1.2
1 2 3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3
Time -pointTime -point
Time -point
Nor
mal
ized
Exp
ress
ion
Nor
mal
ized
Exp
ress
ion
Nor
mal
ized
Exp
ress
ion
![Page 39: BioInformatics (3)](https://reader035.fdocuments.in/reader035/viewer/2022062500/56814fb9550346895dbd71b8/html5/thumbnails/39.jpg)
gpm1HTB1RPL11ARPL12BRPL13ARPL14ARPL15ARPL17ARPL23ATEF2YDL228cYDR133CYDR134CYDR327WYDR417CYKL153WYPL142C
GlycolysisNuclear Organization
Ribosome
Translation
Unknown
Genes MIPS functional category
Evaluate Cluster contents