Capstone Project Presentation
-
Upload
gillian-thomas -
Category
Documents
-
view
31 -
download
0
description
Transcript of Capstone Project Presentation
Friday 17rd December 2004 Stuart Young
Capstone Project PresentationCapstone Project Presentation
Predicting Deleterious Predicting Deleterious MutationsMutations
Young SP, Radivojac P, Mooney SDYoung SP, Radivojac P, Mooney SD
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
DeleteriousDeleterious““Hurtful or injurious to life Hurtful or injurious to life
or health; noxious”or health; noxious”(Oxford English Dictionary)(Oxford English Dictionary)
““Tis pity wine should be so Tis pity wine should be so deleterious, For tea and coffee deleterious, For tea and coffee leave us much more serious.leave us much more serious.””
((BYRONBYRON JuanJuan IV, 1821) IV, 1821)
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
SNPsSNPs
What is an SNP (single What is an SNP (single nucleotide polymorphism)?nucleotide polymorphism)? Why are SNPs important?Why are SNPs important? Some SNPs are Some SNPs are nonsynonymousnonsynonymous The molecular effects of SNPs The molecular effects of SNPs vary widelyvary widely
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
MOTIVATIONMOTIVATION
Improve on the existing Improve on the existing deleterious prediction methods deleterious prediction methods Use protein sequence, Use protein sequence, evolution and structure data evolution and structure data combined with machine learning combined with machine learning to identify potentially disease-to identify potentially disease-causing SNPscausing SNPs
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
SNP data is increasingly SNP data is increasingly availableavailable
Over 40 major online databasesOver 40 major online databases dbSNP is the primary SNP dbSNP is the primary SNP database (contains 5,000,000+ database (contains 5,000,000+ validated human SNPs) validated human SNPs) Many databases contain Many databases contain potentially disease-causing SNPs potentially disease-causing SNPs related to a particular diseaserelated to a particular disease
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Deleterious effects of Deleterious effects of mutations on proteinsmutations on proteins
FunctionFunction StabilityStability ExpressionExpression Protein-Protein Protein-Protein InteractionsInteractions
Friday 17rd December 2004 Stuart Young
Current Classification ToolsCurrent Classification Tools
Sequence Approaches BLOSUM62
An amino acid substitution score matrix
SIFT Collects sequence homologues in multiple alignments and identifies non-conservative changes in amino acidsNg P & Henikoff S, 'Predicting Deleterious Amino Acid Substitutions‘. Genome Research, 2001, 11:863-874.
Friday 17rd December 2004 Stuart Young
Current Classification ToolsCurrent Classification Tools
Structural Approaches
Expert rulesUses evolutionary and structural dataSunyaev et al, 'Prediction of deleterious human alleles‘. Human Molecular Genetics, 2001, Vol. 10, No. 6, 593.
Decision Trees Improved performance based on
sequence and structural data Produces intuitive rules
Friday 17rd December 2004 Stuart Young
Our foundation for the projectOur foundation for the project
Saunders CT & Baker D
‘‘Evaluation of Structural and Evolutionary
Contributions to Deleterious Mutation Prediction’
J. Mol. Biol. (2002) 322, 891–901
Structural and evolutionary Structural and evolutionary featuresfeatures Trained classifiers based on Trained classifiers based on two data sets - experimental two data sets - experimental mutations and human alleles mutations and human alleles
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
S & B - Training Sets Experimental mutations (~5,000)
HIV-1 proteaseE. Coli Lac repressor
T4 Lysozyme Human alleles (~350 mutations) 103 ‘hot’ human genes
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Why two training sets? Unbiased human data is hard to get:
Many disease-associated mutations are discovered through genetics association studies and may not be causative (i.e., only linked with the causative allele) Effect of mutations is hard to measure
Experimental ‘whole gene mutagenesis’ data is used considered ‘unbiased’
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Features used in S&B
Study
SIFT SIFT + Solvent Accessibility(SA) SIFT + normalized B-factor SIFT + Sunyaev expert rules SIFT + SA + B-factor
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Hypothesis
Can we improve on the results of Saunders and
Baker by using more structural and sequence
properties?
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Experimental Design Classification algorithm
Decision Trees Support Vector Neural Nets
Additional Features Amino acid relative frequencies Additional structural properties
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Structural Property ValuesRuss Altman (Stanford) developed a vector representation of protein
structural sites Spheres (1.875Å → 7.5Å) centered on C-alpha atom of the mutation position 66 features Atom/residue counts within sphere and other features, e.g.:
Solubility Solvent accessibility
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Amino Acid Windows
AA frequencies within a window on either side of the mutation position 20 AAs = 20 features LEFT and RIGHT → 40 features
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Amino Acid Windows
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Tools
Databases PDB - Protein structure data S-BLEST - Structural features
Software Perl 5.8.0 Matlab (NN, PRTools(DT), SVC)
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
List of Features Used BLOSUM62, disorder, secondary structure, molecular weight Grouped amino acid frequency windows of varying widths SIFT S-BLEST (vector contains four sub-shells spreading outward from site) Solvent accessibility (C-beta density, i.e., the number of C-beta atoms around the site)
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Comparison with S&B
Results
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
1. Human Data Set Human allele dataset as train and test set Ensembles of decision trees for classification 20-fold cross validation Progressively added features to see their affect on performance Because structural data was not available for all mutation sites, we used a subset of the original Saunders and Baker training set
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Best Features
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
1. Experimental Data Set Same as human data set but using experimental mutations for training and testing
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Evaluation of S-BLEST Using a Random Subset of the Experimental Training
Set
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
3. Cross-classification Used the same features described above Trained on one dataset and tested on the other:
Human to experimental Experimental to human Experimental gene to exp. gene
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Summary of Results
Human data set80% accuracy (up from 70%)
Experimental data set87% accuracy (up from 79.5%)
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Conclusion
Prediction tools CAN identify deleterious mutations We believe that further study is warranted to identify over-fitted classifiers to further improve classification accuracy on real world data
Friday 17rd December 2004 Stuart Young
AcknowledgementsAcknowledgements
PeopleAndrew Campen (CCBB IT, IUPUI)
Brandon Peters (CCBB, IUPUI)Haixu Tang (Capstone Coordinator, IUB)
FundingThis work was funded by a grant from the
Showalter Trust (Sean Mooney, PI), INGEN, and a IUPUI McNair Scholarship. The Indiana
Genomics Initiative (INGEN) Indiana University is supported in part by Lilly Endowment Inc.
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Thank You
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations
Friday 17rd December 2004 Stuart Young
Predicting Deleterious Predicting Deleterious MutationsMutations