Download - 1 st MS 2 2 nd 3 rd 4 th 5 th 6 th 10 th 9 th 8 th 7 th Relative Intensity Fill Times Scan Times “shotgun sequencing”

1 st MS 2 2 nd 3 rd 4 th 5 th 6 th 10 th 9 th 8 th 7 th Relative Intensity Fill Times Scan Times shotgun sequencing

MS/MS Spectrum Protein Database spectral matching

time shotgun sequencing

ms 1 ms 2 time shotgun sequencing

LTQ Orbitrap base peak chromatogram 37 min LC-MS/MS run-time 6186 MS/MS spectra 2308 peptide IDs (false-positive rate 1%) 287 protein IDs 6000 spectra x 10s/spectrum = 16 CPU hours Server single CPU search time 16 hours Server 20 nodes parallel CPUs 0.8 hours distributed spectral matching

XCorr: goodness of fit between theoretical b and y ions from peptides in the database dCn: fractional XCorr difference between the highest XCorr and next highest XCorr sequest yates j.r. 3 rd et al. j am soc mass spectrom 5:976-89 (1994)

ms 1 ms 2 time 5000 - 25000 ms 2 spectra all ms 2 in LC run sequest

all ms 2 in LC run 1 dta all raw 501.000 (precursor m/z) +2 (charge state) ms2 array (all ms2 = 1 file) 1 ms2 = 1 file (all ms2 = ~10000 files) 2 dta 1001.500 (precursor m/z) +3 (charge state) ms2 array sequest

2 x 3,250,000 times3 x 3,250,000 times 10000 x 3,250,000 times all ms 2 in LC run 1 dta, 2 3 10000 dta 1000.000 +/- 1Da human ipi database 61236 proteins peptide mass: MSQVQVQVQNPSAALSGSQILNK digest to next peptide calculate peptide mass 2426.258812 compare with precursor not a candidate if cand., calc. theoretical spectrum correlate, score & return 3000.000 +/- 1Da 3,250,000 times sequest

yates j.r. 3 rd et al. j am soc mass spectrom 5:976-89 (1994) theoretical candidate spectrumexperimental peptide spectrum correlation spectrum

yates j.r. 3 rd et al. j am soc mass spectrom 5:976-89 (1994) correlation spectrum

yates j.r. 3 rd et al. j am soc mass spectrom 5:976-89 (1994) correlation spectrum similarity scoring Xcorr score

Xcorr (cross-correlation) Dot product similarity scoring cross-correlation vs dot product Dot product

human ipi database 61236 proteins >ipi00000001.2 MSQVQVQVQNPSAALSGSQILNKNQSLLSQ PLMSIPSTTSSLPSENAGRPIQNSALPSASITST SAAAESITPTVELNAL. 1 st >ipi00853644.1 .AKPNINLITGHLEEPMPNPIDEMTEEQKEY EAMKLVNMLDKLSREELLKPMGLKPDGTIT 61236 th 1200 +/- 1Da non-indexed searching

human ipi database 61236 proteins >ipi00001234.11 G 75 Da >ipi00853644.1 AKPNINLITGHLEEPMPNPIDEMTEEQEYEA MLVNMLDLSEELLKPMGLKPDGTITAKPNINL ITGHLEEPMPNPIDEMTEEQEYEAMLVNML DLSEELLKPMGLKPDGTIT 20245 Da indexed >ipi00344567.1 WEFGGHTVLR 1200 +/- 1Da indexed searching

scoring & analysis score/criterion frequency TP TN cutoff/threshold FN FP Score/Metric 1Score/Metric 2Score/Metric 3 Peptide A7.650.9997 Peptide B6.990.8797 Peptide C6.210.6597 Peptide D5.570.7196 Peptide E3.310.4450 Peptide F1.850.4141 sensitivity = TP TP + FN precision = TP TP + FP specificity = TN TN + FP accuracy = TP + TN TP + TN + FN + FP

The Results: Distinguishing Right from Wrong In large proteomics data sets (for which manual data inspection is impossible), how can we distinguish between correct and incorrect peptide assignments? Use decoy sequences to distract non-peptidic, non- uniquely matchable, or otherwise unmatchable spectra into a search space that is known a priori to be incorrect Use the frequency of decoy sequences among total sequences to estimate the overall frequency of wrong answers (False Positive Rate) Adjust filtering criteria to achieve a ~ 1% False Positive Rate

Decoy Sequences? A Reversed Database! We generate decoy sequences by reversing each protein sequence in a given database, such that the resultant in silico digest contains nonsense peptides, then append the reversed database to the end of the forward database Decoy references are labeled with # Database searching with SEQUEST occurs from top to bottom when decoy references are found, there is an equal probability it could have also mapped to a non-decoy sequence. So our FPR is (# of decoys) x 2 / total matches. S E A R C H I N G

Forward database 1.MAGFA SHTRP Reversed database 1.PRTHS AFGAM Composite Database Sequest Right Wrong (random) F FR 50% 100% Filter (scoring, mass accuracy, etc) Generate final list Estimate FP rate from 2 x Rev (i.e., 4%) Known FP Unknown FP Target/Decoy Database Searching

Cn XCorr Forward Sequences Cn XCorr Forward + Reverse TPFP PSM number sequest scores: finding true positives XCorr

Precision of mass errors between observed and actual m/z LTQ Orbitrap & LTQ FT 0.1 0.4 ppm LTQ FT (SIM) AGC target 50,000 to avoid space-charge effects Olsen et al. (2004) Mol. Cell. Proteomics 3, 608 -0.2 1.0 ppm High Mass Accuracy Haas et al. (2006) Mol. Cell. Proteomics 5, 1326 Mass Accuracy in Proteomics: Performance is related to the width of the distribution, not the average error

MMA: True Positives and False Positives MMA0 True Positives False Positives TPFP PSM number False positives are distributed evenly across MMA space

MS/MS vs MMA: Precision vs Sensitivity MMA0 0 MS/MS criteria are strong precision filters require TP / FP separation for sensitivity MMA criteria are weak precision filters assists MS/MS criteria in improving sensitivity

Distracting Wrong from Right: MMA MMA0 True Positives False Positives True Positives False Positives MMA 0 Extended Search Space Search Space Filtered

Mass Accuracy: Another dimension of selectivity Cn XCorr Cn XCorr Forward Sequences Cn XCorr Forward + Reverse Tryptic Search +/- 2Da Cn XCorr Tryptic Search +/- 2Da 5ppm filter

Distracting Wrong from Right: Trypticity True Positives False Positives K/R-PeptideK/R- True Positives False Positives A-G-C-S-T-I-L-F-P-M-V-H-D-E-Y-W-Q-N- A-G-C-S-T-I-L-F-P-M-V-H-D-E-Y-W-Q-N- PeptideK/R- K/R-Peptide Filtered Tryptic Search Partial Enzyme Search

Phosphorylated Unphosphorylated XCorr dCn n = 286 What do we have here, hm? 0 0.2 0.4 0.6 0.8 1 02468 Reversed Hits

dCn (Phosphorylated) dCn (Unphosphorylated) Doubly Phosphorylated (n=79)Singly Phosphorylated (n=207) n = 286 Phosphopeptides: Chemically disadvantaged XCorr (Unphosphorylated) XCorr (Phosphorylated) n = 286 0 2 4 6 8 02468 Dataset of phosphorylated and unphosphorylated peptide MS/MS pairs MSFEILR P

Doubly Phosphorylated Singly Phosphorylated XCorr (Ph/UnPh) 86% Phosphopeptides: Less power in XCorr & dCn Unphosphorylated 93% 0 0.5 1 1.5 2 dCn (Ph/UnPh) Unphosphorylated

Yeast Whole-Cell Lysate Red., Alkyl. SDS-PAGE 60-80 kDa Trypsin IMAC-purification Mass Accuracy: Can it help for phosphorylation?

-50500 Mass Accuracy: Rescuing phosphopeptides +2: 1.3 +3: 2.3 +2: 2.7 +3: 3.5 XCorr n=1390 LTQ TOP10 SEQUEST partial enzyme search, fully tryptic peptide spectral matches n=1311 MMA (ppm) Orbitrap TOP10 XCorr

LTQ Orbitrap 600 1.0% FP 1046 0.4% FP 74% increase Mission: Phosphopeptide rescue accomplished! 715 1.0% FP No MMAMMA # of phosphopeptides

search algorithms & phosphorylation Bakalarski et al., Anal. Bioanal. Chem., 2007 sequest omssa 936 928 98

phosphorylation site localization GFDSNQpTWR or GFDpSNQTWR? Beausoleil et al., Nat. Biotechnol, 2006

phosphorylation site localization Beausoleil et al., Nat. Biotechnol, 2006

phosphorylation site localization Taus et al., JPR, 2011

phosphorylation localization rate (FLR) Chalkey & Clauser, MCP, 2012 Baker et al., MCP, 2011 use non-native phosphoacceptors as decoys Ser + Thr (human proteome): 14.1% Pro + Glu (human proteome): 14.5% allow search engine / localization assessment tools to consider pP and pE as true negative decoys calculate dataset FLR based on frequency of pP + pE decoys