Proteomic Characterization of Alternative Splicing and Coding Polymorphism
description
Transcript of Proteomic Characterization of Alternative Splicing and Coding Polymorphism
Proteomic Characterization of Alternative Splicing and Coding
Polymorphism
Nathan EdwardsCenter for Bioinformatics and Computational
Biology
University of Maryland, College Park
Why don’t we see more novel peptides?
Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
What goes missing?
Known coding SNPs
Novel coding mutations
Alternative splicing isoforms
Alternative translation start-sites
Microexons
Alternative translation frames
Why should we care?
Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins
Proteins have clinical implications• Biomarker discovery
Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site
Novel Splice Isoform
Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.
LIME1 gene:• LCK interacting transmembrane adaptor 1
LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.
Multiple significant peptide identifications
Novel Splice Isoform
Novel Splice Isoform
Novel Mutation
HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female
healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)
TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
Novel Mutation
Searching Expressed Sequence Tags (ESTs)
Pros
No introns!
Primary splicing evidence for annotation pipelines
Evidence for dbSNP
Often derived from clinical cancer samples
Cons
No frame
Large (8Gb)
“Untrusted” by annotation pipelines
Highly redundant
Nucleotide error rate ~ 1%
Compressed EST Peptide Sequence Database
For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mersGene-centric peptide sequence database:
• Size: 223 Mb vs 8 Gb, 20774 FASTA entries• Running time: 15 mins vs 22 hours• E-values: 50-fold reduction
Download:• http://www.umiacs.umd.edu/~nedwards
Back to the lab...
Current LC/MS/MS workflows identify a few peptides per protein• ...not sufficient for protein isoforms
Need to raise the sequence coverage to (say) 80%• ...protein separation prior to LC/MS/MS
analysis
Future informatics directions...
Combine results from multiple searches from multiple engines
Fast, automated triage of “significant false-positive” peptide identifications
Compressed EST peptide sequence database for other species• Mouse, Rat, Zebrafish, Chicken, Cow, A. thaliana, ??
Relational database and web-application infrastructure• Interactive browser data-grid, flexible web-services export• Java Applet MS/MS viewers, GFF for Genome Browser
Conclusions
Peptides identify more than just proteins• Untapped source of disease biomarkers• Functional vs silencing variants
Compressed peptide sequence databases make routine EST searching feasible
Statistically significant peptide identification is only the first step
Acknowledgements
Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry
Chau-Wen Tseng, Xue Wu• UMCP Computer Science
Cheng Lee• Calibrant Biosystems
PeptideAtlas, HUPO PPP, X!Tandem
Funding: NCI