1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY.
-
Upload
monserrat-collinge -
Category
Documents
-
view
215 -
download
0
Transcript of 1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY.
1
PROTERAN:PROTERAN:
ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY
2
The need for BioinformaticsThe need for Bioinformatics
Bioinformatics: Application of computational techniques to the management and analysis of biological information.
Clustering techniques applied on data not enough. Need a good visual representation
3
AgendaAgenda Microarrays Review of existing clustering and visualization
techniques on gene expression data The need for a customized visualization tool for use
by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics
Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN
4
Function of Genes & ProteinsFunction of Genes & Proteins
Through the proteins they encode genes orchestrate the mysteries of life
Protein functions vary widely from mechanical support to transportation to regulation.
5
Still a lot of work aheadStill a lot of work ahead
Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low.
Believed that many genes work together; this is not exhibited in a one-by-one fashion.
6
MicroarraysMicroarrays
Solve the throughput problemAllow scientists to see genes on a genomic
level
7
Expression MatrixExpression Matrix
Experiment 1 Experiment 2 ……….. Experiment M
Gene 1 C511/C311 C512/C312 ……….. C51M/C31M
Gene 2 C521/C321 C522/C322 ……….. C52M/C32M
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM
8
Clustering & Visualization Clustering & Visualization Techniques ReviewTechniques Review
9
ClusteringClustering
Clustering: Act of grouping similar objects together
Applied to gene expression in order to find the function of unknown genes
Many different clustering techniques in the literature. Represented techniques are discussed next.
10
Determining similarity Determining similarity between two genesbetween two genes
Choose a similarity distance to compare genes
e.g. Euclidian distance
Experiment 1 Experiment 2 ……….. Experiment M
Gene 1 C511/C311 C512/C312 ……….. C51M/C31M
Gene 2 C521/C321 C522/C322 ……….. C52M/C32M
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Gene N C5N1/C3N1 C5N2/C3N2 ……….. C5NM/C3NM
11
Hierarchical ClusteringHierarchical Clustering
1. Create distance matrix of all genes in relation to each other
2. Find the two closest genes
3. Merge these two genes and redo distance matrix
4. Repeat steps 2-3 until only one cluster left
12
DendrogramDendrogram
Binary tree with a distinguished root, which has all the data items at the leaves
Re-orders the expression matrix to place similar genes beside each other
13
ExampleExample A B C D
A 0 1 6 8
B 0 5 7
C 0 2
D 0
(A,B) C D
(A,B) 0 5 7
C 0 2
D 0
(A,B) (C,D)
(A,B) 0 5
(C,D) 0
Agglomerative Hierarchical Clustering
14
AdvantagesAdvantages
Familiar to biologistsFew parameters to specify
15
DisadvantagesDisadvantages
Requires fast CPUs and large amounts of memory
Does not identify important clustersOnly represents hierarchical organized dataDoes not scale up
16
Disadvantages cont..Disadvantages cont..
Dendrogram always offers 2n-1
representations (where n = number of elements)
17
Self Organizing Maps (SOMs)Self Organizing Maps (SOMs)
User picks number of clusters called nodesNodes randomly mapped to M-dimensional
space (M = # of experiments)Node values are adjusted by random vectors
picked from original dataAfter node values settle vectors are
clustered to closest node
18
VisualizationVisualization
1. Dendrogram
2. Error Bar Representation
19
VisualizationVisualization
3. U-Matrix
20
AdvantagesAdvantages
User has partial control over structureFuzzy ClustersVariety of visual techniques applicable
21
DisadvantagesDisadvantages
Knowledge of number of clusters beforehand
Many parameters to specify
22
Principle Component Analysis Principle Component Analysis (PCA)(PCA)
Mathematical technique that can be used to reduce the number of dimensions of data
Principal component analysis
23
VisualizationVisualization
24
AdvantagesAdvantages
No parameters required3D Visualization
25
DisadvantagesDisadvantages
Little control over structureRunning time of O(N3)Not applicable when input is a distance
matrix
26
BiclusteringBiclustering
Clustering of both rows and columns simultaneously
27
Available SoftwareAvailable SoftwareSoftware Name Description Available at
F-Scan Quantification and analysis of fluorescently probed microarrays; scatterplots; multiple image comparison.
http://abs.cit.nih.gov/fscan/
TIGR SpotFinder Spot identification. http://www.tigr.org/software/
Cluster Hierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA
http://rana.lbl.gov/EisenSoftware.htm
Genesis A Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA,
http://genome.tugraz.at/Software/GenesisCenter.html
J-Express Pro 2.0 Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization
http://www.molmine.com/frameset/frm_jexpress.htm
TreeView Cluster output visualization http://rana.lbl.gov/EisenSoftware.htm
28
Protein FoldingProtein Folding
29
Reaction CoordinatesReaction Coordinates
Folding determines the function of protein All-atom recreation of protein unrealistic Reaction coordinates used to describe protein
structure1. Fraction of Native Contacts2. Radius of Gyration3. RMSD from the native structure4. Number of beta-strand Hydrogen Bonds5. Number of alpha helix turns6. Hydrophobic core radius of gyration7. Principle Components
30
Protein StatesProtein States
While folding, a protein goes through certain states
The raw data is similar to microarray data.Dr. Parida and Dr. Zhou have developed their
own techniques and clustered β-Hairpin data.
31
Reaction Coordinates used on Reaction Coordinates used on the the ββ-Hairpin-Hairpin
1. Number of Native β-strand hydrogen bonds2. Radius of gyration of the hydrophobic core residues3. Radius of gyration of entire protein4. Fraction of native contacts5. Principle component 1 6. Principle component 27. Root mean square deviation (RMSD) from the
native structure.
32
Raw DataRaw Data
33
Patterned ClusterPatterned Cluster
RED = Number of columns in pattern. (Also defined as the Pattern Type)
WHITE = Column Number
PURPLE = Column Value
YELLOW = Number of occurrences
GREEN = Occurrences
2 0 0.1 4 0.23
3 23 26 27
34
Sample Patterned Cluster FileSample Patterned Cluster File2 0 7.335 1 0.735
1006 59728 87235 94826-94831 95748-95752 95761-95763 … 120424-120426
2 0 7.335 1 0.736
1003 59728 87235 94826-94831 95748-95752 95761-95763 … 95769
3 0 7.335 4 -5.881 6 3.292
1036 59728 72071 87235 94826 94828-94831 … 95761-95763
3 0 7.335 4 -5.881 5 2.214
1056 59728 72071 87235 94826 94828-94831 … 95761-95763
:
5 2 8.144 3 0.899 4 -3.855 5 -33.574 6 3.292
1089 45533 59728 72071 87235 94826 … 95748-95752
35
The need for Visual Analysis of The need for Visual Analysis of Patterned Cluster DataPatterned Cluster Data
β-Hairpin file approx 500MB largeDifficult to study the textual representation
and get a global viewVery difficult to see interaction of all
patterned clusters in relation to each otherAlso very difficult to remember all
patterned clusters and their occurrence in time
36
Visual RequirementsVisual Requirements
Global ViewNavigation & FocusRelative growthDetails of characteristics on demand
37
Need for Customized ToolNeed for Customized Tool
All of the existing visualization techniques on microarrays had one or more drawbacks
None were able to provide a visual for depicting relative growth of clusters.
38
Terrain MetaphorTerrain Metaphor
Has been shown to be a useful technique in searching a corpus of documents
Very recently the idea has been applied to gene expression with high density clusters representing mountains
39
Using a Landscape Metaphor to Using a Landscape Metaphor to solve our requirementssolve our requirements
Each mountain represents a patterned cluster
Mountain growth represents evolution of patterned cluster
Clicking on mountains returns details of patterned cluster
40
PROTERANPROTERAN
41
Mapping of Patterned Cluster Mapping of Patterned Cluster Data into Terrain GeometryData into Terrain Geometry
42
Mapping of Patterned Cluster Mapping of Patterned Cluster data into Terrain Geometrydata into Terrain Geometry
Pattern Type: Number of columns in a patterned cluster
Column Combination: Unique number that identifies a combination of columns
2 0 0.1 4 0.23
3 23 26 27
43
Column CombinationsColumn Combinations c!
(c – t)! * t! c = number of characteristics
t = pattern number
Pattern Type Number of Column Combinations
2 21
3 35
4 35
5 21
6 7
7 1
44
LayoutLayout
We first thought of using an automated layout technique. However, one of Dr. Zhou’s requirements was that the same pattern
cluster should appear in the same position for consistent interpretation.
Another was that larger pattern types (6 and 7 column) must be very distinguishably placed.
Hence it was decided to use a manual layout design described next.
45
LayoutLayout01 02 03 01234 01235 01236 012 013 014 015 016
04 05 06 01245 01246 01256 023 024 025 026 034
12 13 14 01345 01346 01356 035 036 045 046 056
15 16 23 01456 02345 02346 123 124 125 126 134
24 25 26 02356 02456 03456 135 136 145 146 156
34 35 36 12345 12346 12356 234 235 236 245 246
45 46 56 12456 13456 23456 256 345 346 356 456
0123 0124 0125 0126 0134
0135 0136 0145 0146 0156
012345 012346 012356 0234 0235 0236 0245 0246
0123456 012456 013456 023456 0256 0345 0346 0356 0456
123456 1234 1235 1236 1245 1246
1256 1345 1346 1356 1456
2345 2346 2356 2456 3456
46
Top Patterned Clusters Top Patterned Clusters VisualizedVisualized
Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized
10TH Highest Occurrence of combination 01
9TH Highest Occurrence of combination 01
2ND Highest Occurence of combination 01
3RD Highest Occurrence of combination 01
8TH Highest Occurrence of combination 01
Highest Occurrence of combination 01
4TH Highest Occurrence of combination 01
7TH Highest Occurrence of combination 01
6TH Highest Occurrence of combination 01
5TH Highest Occurrence of combination 01
47
PROTERAN LAYOUTPROTERAN LAYOUT
48
Animated Terrain EvolutionAnimated Terrain Evolution
Time proceeds from 0 to the maximum number of experiments
Each time unit all patterned clusters are checked
If there is an occurrence the mountain’s height is increased
49
Mountains of PROTERANMountains of PROTERAN
50
Results & ExtensionsResults & Extensions
51
ResultsResults
Very encouraging feedback Easy to use layout and the interface allows
1. Identification of states
2. Obtain values of patterned clusters
3. Relation of patterned clusters to each other as they grow over time
In the initial use itself, Dr. Zhou said that “ he was able to find that the hydrophobic core is largely formed before the beta-strand hydrogen bonds are formed.”
52
Future of PROTERANFuture of PROTERAN
Introduced at the Intelligent Systems For Molecular Biology (ISMB) in Scotland – Received very well
Robert-Cedergren Bioinformatics Colloquium at University of Montreal (Sept 23-24th)
53
ExtensionsExtensions
Analyze with different types of protein dataMore generic layout with more
characteristicsApplication with different types of data
54
SummarySummary
1. Review of existing techniques to cluster and visualize gene expression data
2. Protein characteristics data is similar to that of gene expression data
3. None of the existing techniques applied, thus the need for a customized visual
4. Terrain Metaphor to solve our requirements implemented in the program PROTERAN
55
QuestionsQuestions