Protein Sequence Databases, Peptides to Proteins, and Statistical Significance
description
Transcript of Protein Sequence Databases, Peptides to Proteins, and Statistical Significance
![Page 1: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/1.jpg)
Protein Sequence Databases, Peptides to Proteins, and
Statistical Significance
Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center
![Page 2: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/2.jpg)
2
Protein Sequence Databases
• Link between mass spectra and proteins• A protein’s amino-acid sequence provides
a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation• Peptide ion masses
• We must interpret database information as carefully as mass spectra.
![Page 3: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/3.jpg)
3
More than sequence…
Protein sequence databases provide much more than sequence:
• Names• Descriptions• Facts• Predictions• Links to other information sources
Protein databases provide a link to the current state of our understanding about a protein.
![Page 4: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/4.jpg)
4
Much more than sequence
Names• Accession, Name, Description
Biological Source• Organism, Source, Taxonomy
LiteratureFunction
• Biological process, molecular function, cellular component
• Known and predictedFeatures
• Polymorphism, Isoforms, PTMs, DomainsDerived Data
• Molecular weight, pI
![Page 5: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/5.jpg)
5
Database types
Curated• Swiss-Prot• UniProt• RefSeq NP
Translated• TrEMBL• RefSeq XP, ZP
Omnibus• NCBI’s nr• MSDB• IPI
Other• PDB• HPRD• EST• Genomic
![Page 6: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/6.jpg)
6
SwissProt
• From ExPASy • Expert Protein Analysis System• Swiss Institute of Bioinformatics
• ~ 515,000 protein sequence “entries”• ~ 12,000 species represented• ~ 20,000 Human proteins• Highly curated• Minimal redundancy• Part of UniProt Consortium
![Page 7: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/7.jpg)
7
TrEMBL
• Translated EMBL nucleotide sequences• European Molecular Biology Laboratory
• European Bioinformatics Institute (EBI)• Computer annotated • Only sequences absent from SwissProt• ~ 10.5 M protein sequence “entries”• ~ 230,000 species• ~ 75,000 Human proteins• Part of UniProt Consortium
![Page 8: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/8.jpg)
8
UniProt
• Universal Protein Resource• Combination of sequences from
• Swiss-Prot• TrEMBL
• Mixture of highly curated/reviewed (SwissProt) and computer annotation (TrEMBL)
• “Similar sequence” clusters are available• 50%, 90%, 100% sequence similarity
![Page 9: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/9.jpg)
9
RefSeq
• Reference Sequence• From NCBI (National Center for
Biotechnology Information), NLM, NIH• Integrated genomic, transcript, and
protein sequences.• Varying levels of curation
• Reviewed, Validated, …, Predicted, …• ~ 9.7 M protein sequence “entries”
• ~ 209,000 reviewed, ~ 90,000 validated• ~ 39,000 Human proteins
![Page 10: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/10.jpg)
10
RefSeq
• Particular focus on major research organisms• Tightly integrated with genome projects.
• Curated entries: NP accessions• Predicted entries: XP accessions• Others: YP, ZP, AP
![Page 11: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/11.jpg)
11
IPI
• International Protein Index• From EBI
• For a specific species, combines• UniProt, RefSeq, Ensembl• Species specific databases: HInv-DB, VEGA, TAIR
• ~ 87,000 (from ~ 307,000 ) human protein sequence entries
• Human, mouse, rat, zebra fish, arabidopsis, chicken, cow
• Slated for closure November 2010, but still going…
![Page 12: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/12.jpg)
12
MSDB
• From the Imperial College (London)• Combines
• PIR, TrEMBL, GenBank, SwissProt• Distributed with Mascot
• …so well integrated with Mascot• ~ 3.2M protein sequence entries• “Similar sequences” suppressed
• 100% sequence similarity• Not updated since September 2006
(obsolete)
![Page 13: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/13.jpg)
13
NCBI’s nr
• “non-redundant”• Contains
• GenBank CDS translations• RefSeq Proteins• Protein Data Bank (PDB)• SwissProt, TrEMBL, PIR• Others
• “Similar sequences” suppressed• 100% sequence similarity
• ~ 10.5 M protein sequence “entries”
![Page 14: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/14.jpg)
14
Human Sequences
• Number of Human genes is believed to be between 20,000 and 25,000
SwissProt ~ 20,000
RefSeq ~ 39,000
TrEMBL ~ 75,000
IPI-HUMAN ~ 87,000
MSDB ~130,000
nr ~230,000
![Page 15: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/15.jpg)
15
DNA to Protein Sequence
Derived from http://online.itp.ucsb.edu/online/infobio01/burge
![Page 16: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/16.jpg)
16
UCSC Genome Browser
• Shows many sources of protein sequence evidence in a unified display
![Page 17: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/17.jpg)
17
Accessions
• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format
• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;
![Page 18: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/18.jpg)
18
Names / IDs
• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects
• ALBU_HUMAN• Serum Albumin
• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30
• CP3A7_HUMAN• Cytochrome P450 3A7
![Page 19: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/19.jpg)
19
Description / Name
• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….
• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-
related protein, mitochondrial [Precursor]
![Page 20: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/20.jpg)
20
FASTA Format
• >• Accession number
• No uniform format• Multiple accessions separated by |
• One line of description• Usually pretty cryptic
• Organism of sequence?• No uniform format• Official latin name not necessarily used
• Amino-acid sequence in single-letter code• Usually spread over multiple lines.
![Page 21: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/21.jpg)
21
FASTA Format
![Page 22: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/22.jpg)
22
Organism / Species / Taxonomy
• The protein’s organism…• …or the source of the biological sample
• The most reliable sequence annotation available
• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used
• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up
• Organism specific sequence databases starting to become available.
![Page 23: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/23.jpg)
23
Organism / Species / Taxonomy• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus
• Rattus sp.
• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats
![Page 24: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/24.jpg)
24
Controlled Vocabulary
• Middle ground between computers and people
• Provides precision for concepts• Searching, sorting, browsing• Concept relationships
• Vocabulary / Ontology must be established• Human curation
• Link between concept and object:• Manually curated• Automatic / Predicted
![Page 25: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/25.jpg)
25
Gene Ontology
• Hierarchical• Molecular function• Biological process• Cellular component
• Describes the vocabulary only!• Protein families provide GO association
• Not necessarily any appropriate GO category.• Not necessarily in all three hierarchies.• Sometimes general categories are used because
none of the specific categories are correct.
![Page 26: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/26.jpg)
26
Gene Ontology
![Page 27: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/27.jpg)
27
Protein Families
• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function
• Bootstrap up from small sets of proteins/domains with well understood characteristics
• Usually a hybrid manual / automatic approach
![Page 28: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/28.jpg)
28
Protein Families
![Page 29: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/29.jpg)
29
Protein Families
![Page 30: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/30.jpg)
30
Sequence Variants
• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification
• Sequence databases typically do not capture all versions of a protein’s sequence
![Page 31: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/31.jpg)
31
Swiss-Prot Variant Annotations
![Page 32: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/32.jpg)
32
Swiss-Prot Variant Annotations
![Page 33: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/33.jpg)
33
Omnibus Database Redundancy Elimination
• Source databases often contain the same sequences with different descriptions
• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference
• Good definitions can be lost, including taxonomy
![Page 34: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/34.jpg)
34
Description Elimination
• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]
• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]
• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]
• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]
• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4
• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]
![Page 35: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/35.jpg)
35
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
![Page 36: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/36.jpg)
36
Peptides to Proteins
![Page 37: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/37.jpg)
37
Peptides to Proteins
• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families
• Separation, digestion and ionization is not well understood
• Proteins in sequence database are extremely non-random, and very dependent
![Page 38: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/38.jpg)
Indistinguishable Protein Sequences
38 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
![Page 39: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/39.jpg)
Indistinguishable Protein Sequences
39 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
![Page 40: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/40.jpg)
Protein Families
40 Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
![Page 41: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/41.jpg)
Protein Grouping Scenarios• Parsimony
• Minimum # of proteins• Weighted
• Choose proteinswith the most confident peptides(ProteinProphet)
• Show all • Mark repeated peptides
• Often no (ideal) resolution is possible!
41Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
![Page 42: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/42.jpg)
42
High Quality Peptide Identification: E-value < 10-8
![Page 43: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/43.jpg)
43
Moderate quality peptide identification: E-value < 10-3
![Page 44: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/44.jpg)
44
Peptide Identification
• Peptide fragmentation by CID is poorly understood
• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …
• Correct identifications don’t come with a certificate!
![Page 45: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/45.jpg)
45
Peptide Identification
• High-throughput workflows demand we analyze all spectra, all the time.
• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone
• Peptides may not match our assumptions• …its all Greek to me
• “Don’t know” is an acceptable answer!
![Page 46: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/46.jpg)
46
What scores do “wrong” peptides get?• Generate random peptide sequences
• Real looking fragment masses• Empirical distribution• Require similar precursor mass
• Arbitrary score function can model anything we like!
![Page 47: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/47.jpg)
47
Random Peptide Scores
Fenyo & Beavis, Anal. Chem., 2003
![Page 48: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/48.jpg)
48
Random Peptide Scores
Fenyo & Beavis, Anal. Chem., 2003
![Page 49: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/49.jpg)
49
Random Peptide Scores
• Truly random peptides don’t look much like real peptides
• Just use peptides from the sequence database!
• Assumptions:• IID sampling of “score” values per spectra
• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent
![Page 50: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/50.jpg)
50
Extrapolating from the Empirical Distribution• Often, the empirical shape is
consistent with a theoretical model
Geer et al., J. Proteome Research, 2004Fenyo & Beavis, Anal. Chem., 2003
![Page 51: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/51.jpg)
E-values vs p-values
• Need to adjust for the size of the sequence database• Best false/random score goes up with number
of trials• E-value makes this adjustment
• Expected number of incorrect peptides (with this score) from this sequence database.
• E-value = # Trials * p-value (to 1st approx.)
51
![Page 52: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/52.jpg)
52
False Discovery Rate
• Which peptide IDs to accept?• E-value only provides a per-spectrum statistic• With enough spectra, even these can be
misleading!• Decide which spectra (w/ scores) will be
accepted:• SEQUEST Xcorr, E-value, Score, etc., plus...• Threshold on identification criteria
• Control the proportion of incorrect identifications in the result for entire dataset
![Page 53: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/53.jpg)
Distribution of scores over all spectra
53
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
Brian Searle, Proteome Software
![Page 54: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/54.jpg)
Distribution of scores over all spectra
54
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
False
True
Brian Searle, Proteome Software
![Page 55: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/55.jpg)
55
False Discovery Rate
• FDRscore ≥ x = # false ids with score ≥ x # all ids with score ≥ x
• Need to estimate numerator!• Assumes the false (and true) scores, sampled
over spectra, are IID• Not true for some peptide-spectrum scores• (Mostly) true for E-values
• Can compute the # false ids using a decoy search…
![Page 56: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/56.jpg)
56
Peptide Prophet
Distribution of spectral scores in the results
Keller et al., Anal. Chem. 2002
![Page 57: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/57.jpg)
Decoy searches
• Shuffle or reverse sequence database• Same size as original• Known false identifications• Estimate “False” distribution
• Alternatively, merge target+decoy results:• Competition between target and decoy scores• Assume false target and false decoys each
win half the time• FDRscore ≥ x = 2 * # decoy ids with score ≥ x
# target ids with score ≥ x 57
![Page 58: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance](https://reader035.fdocuments.in/reader035/viewer/2022062810/56815a8a550346895dc7ffa4/html5/thumbnails/58.jpg)
Summary
• Protein sequence databases have varying characteristics, choose wisely!
• Inferring proteins from peptides can be (very) tricky!
• Statistical significance can help control the proportion of errors in the (peptide-level) results.
58