An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments...
-
Upload
lindsey-hensley -
Category
Documents
-
view
217 -
download
0
Transcript of An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments...
An algorithm to guide selection of specific biomolecules to be studied
by wet-lab experimentsJessica Wehner and Madhavi Ganapathiraju
Department of Biomedical InformaticsUniversity of Pittsburgh School of Medicine
Pittsburgh PA USA
Presented byThahir P. Mohamed
Advancing Practice, Instruction & Innovation through InformaticsOctober 19-23, 2008
2
Protein Structure
Primary Structure: Chain of amino acids
Secondary Structure: Sub-structures such as helixes and strands
Tertiary Structure: Atomic resolution of protein structure
Protein structure is essential for successful design of drugs
3
Challenges in Protein Structure Prediction
• X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure.
• Very expensive
• Very time consuming
• Computational techniques are applied to predict protein structure
4
Computational Protein Structure Prediction
• Machine Learning techniques applied to predict structure
• Experimentally determined structures are used to learn to predict new structures
• When not enough data to learn from:
• Active learning is applied to select the next protein to be studied experimentally
5
Active Learning
Unlabeled Proteins
Possible Labels:
6
Cluster Unlabeled Proteins
Clustered Protiens
Possible Labels:
Active Learning
7
Cluster Unlabeled Proteins
Selection Algorithm
Clustered Proteins
Possible Labels:
Active Learning
8
Cluster Unlabeled Proteins
Selection Algorithm
Clustered Proteins
Possible Labels:
Active Learning
9
Prediction
Labeled Protiens
Cluster Unlabeled Proteins
Selection Algorithm
Possible Labels:
Active learning guides selection of data points for which you ask for labels
Active Learning
10Membrane Protein Structure Prediction
Membrane Protein importance and challenges
Membrane Proteins: 30% of genes cell regulation and signaling pathways 60% of drug targets
Yet, Difficult to study experimentally 1% of known protein structures
Active learning can be used as a tool against the limited number of known MP structures despite the large number of
known MP sequences
11
‘Features’ Representation
Data reduction is performed by SVD, resulting in a final 4 features per window.
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I
Topology: - - - - - M M M M M M M M M M M M - - - - - - - - - -
Charge: - - p – p - - - - - - - - - - - - n p - - - - - - - -
E-Prop: D d . . A D D . D D a d d d d d d D A . D D . D a d d
Properties
ChargeSizePolarityAromaticityElectronic Properties
12Clustering the Data
Dim 1Dim 2
Dim
3
Neural Network Self Organizing Map (SOM)
• Finds centroids of clusters in the data
13
Design 1:Density-based Selection
• Find the most dense cluster– Choose N points closest to its centroid
– Find labels for these points (TM or NTM)
– Find the majority label, say L
– Assign L to all points in the cluster
• Repeat for next dense cluster
Clusters with no known structures are marked for study by experiments
14
Design 1 Results• Increase the number of data points for which we ask
structure • Compare how accuracy varies between guided selection
(via active learning) versus random selection.
0102030405060708090
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Pe
rce
nt
Number of labels per node
Density based PRECISION Density based FSCORE
Random based PRECISION Random based FSCORE
A total of only 10 labels per node ~ 1% data
15
Design 2:Protein – based Selection
• Pick a random protein
• Find labels for all windows in this protein
• For each node containing labels, find the mode L of all labels it contains
• Assign L to remaining data in node
• Repeat and update for new protein, until half have been selected
16
Protein-based results
Repeated for different permutations of protein selection order, and observed several metrics.
Pe
rce
nt
Conclusions17
• We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences.
• We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.
Acknowledgements
Madhavi GanapathirajuJessica Wehner
JW funded through NIH-NSF Bioengineering & Bioinformatics Summer
Institute
Visit us at
Department of Biomedical Informatics University of Pittsburgh
Thank you!
Cathedral of Learning, University of Pittsburgh
www.dbmi.pitt.edu/madhavi