Proteomic Characterization of Alternative Splicing and Coding Polymorphism

Proteomic Characterization of Alternative Splicing and Coding

Polymorphism

Nathan EdwardsCenter for Bioinformatics and Computational

Biology

University of Maryland, College Park

Why don’t we see more novel peptides?

Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

What goes missing?

Known coding SNPs

Novel coding mutations

Alternative splicing isoforms

Alternative translation start-sites

Microexons

Alternative translation frames

Why should we care?

Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

Proteins have clinical implications• Biomarker discovery

Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site

Novel Splice Isoform

Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

LIME1 gene:• LCK interacting transmembrane adaptor 1

LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

Novel Mutation

HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

Novel Mutation

Searching Expressed Sequence Tags (ESTs)

Pros

No introns!

Primary splicing evidence for annotation pipelines

Evidence for dbSNP

Often derived from clinical cancer samples

Cons

No frame

Large (8Gb)

“Untrusted” by annotation pipelines

Highly redundant

Nucleotide error rate ~ 1%

Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mersGene-centric peptide sequence database:

• Size: 223 Mb vs 8 Gb, 20774 FASTA entries• Running time: 15 mins vs 22 hours• E-values: 50-fold reduction

Download:• http://www.umiacs.umd.edu/~nedwards

Back to the lab...

Current LC/MS/MS workflows identify a few peptides per protein• ...not sufficient for protein isoforms

Need to raise the sequence coverage to (say) 80%• ...protein separation prior to LC/MS/MS

analysis

Future informatics directions...

Combine results from multiple searches from multiple engines

Fast, automated triage of “significant false-positive” peptide identifications

Compressed EST peptide sequence database for other species• Mouse, Rat, Zebrafish, Chicken, Cow, A. thaliana, ??

Relational database and web-application infrastructure• Interactive browser data-grid, flexible web-services export• Java Applet MS/MS viewers, GFF for Genome Browser

Conclusions

Peptides identify more than just proteins• Untapped source of disease biomarkers• Functional vs silencing variants

Compressed peptide sequence databases make routine EST searching feasible

Statistically significant peptide identification is only the first step

Acknowledgements

Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

Chau-Wen Tseng, Xue Wu• UMCP Computer Science

Cheng Lee• Calibrant Biosystems

PeptideAtlas, HUPO PPP, X!Tandem

Funding: NCI

Proteomic Characterization of Alternative Splicing and Coding Polymorphism

Documents

Transcript of Proteomic Characterization of Alternative Splicing and Coding Polymorphism