BITS - Overview of sequence databases for mass spectrometry data analysis
-
Upload
bits -
Category
Technology
-
view
1.088 -
download
4
description
Transcript of BITS - Overview of sequence databases for mass spectrometry data analysis
http://www.bits.vib.be/training
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
sequence databases
lennart martens
Computational Omics and Systems Biology Group
Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University
Ghent, Belgium
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
PEPTIDES AND REDUNDANCY
IN SEQUENCE DATABASES
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
>Protein 1 LENNARTMARTENS >Protein 2 LENNARTMARTENT
>Protein 1 (1-6) LENNAR >Protein 1 (7-10) TMAR >Protein 1 (11-14) TENS >Protein 2 (1-6) LENNAR >Protein 2 (7-10) TMAR >Protein 2 (11-14) TENT
non-redundant protein DB
≠ non-redundant peptide DB
= =
Database content: all peptide sequences in the database Database information: number of unique peptide sequences
Database information ratio: database information
database content
Peptide-level sequence redundancy
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
93%
41% 45%
42%
23%
1,584,806
3,186,806 3,491,778 4,472,356
10,307,319
1,466,927 1,309,625 1,559,685
1,877,500
2,394,844
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
UniProtKB/SwissProthuman
UniProtKB/TrEMBLhuman
Ensembl human IPI human NCBI nr human
ratio Content information
Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da.
Information ratios for common databases
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
ENRICHING SEQUENCE DATABASES
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
In vivo processing
N C
N C
Enzymatic digest and subsequent NH2-terminal peptide isolation
Not in the sequence database!
+
Search base
ID miss
The influence of the sequence database
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Mitochondrial Isovaleryl-coA Dehydrogenase
N-terminal transit peptide (1-29) MATATRLLGWRVASWRLRPPLAGFVS
QRAHSLLPVDDAINGLSEEQRQLRE…
…LDGIQCFGGNGYINDFPMGRFLRDA KLYEIGAGTSEVRRLVIGRAFNADFH
30 47
423
Isovaleryl-CoA dehydrogenase (30 – 423)
An example
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Revised search base
ID Search
base ID miss
AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR
……
Extending the information content
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Caspase cleavage of this protein (for 50%)
NH2 COOH
R R D R R
COOH R R
D R R NH2
COOH R R
NH2
R NH2
COOH
NH2-terminal peptide isolation
NOT IN DB!
R R NH2 COOH
D
R NH2 COOH
Another example: in vivo protein cleavage
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
R NH2 COOH
result of in vivo protease
result of in vitro trypsin
Title:dualArgC_Cathep Cleavage:DXR Restrict:P Cterm
Title:Arg-C Cleavage:R Restrict:P Cterm
Arg-C definition Arg-C (N-term), Cathepsin (C-term)
definition
Creation of a bifunctional enzyme will generate the correct peptides!
Solving the issue: bifunctional enzymes
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
DBTOOLKIT AND
DATABASE ON DEMAND
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
See: Martens et al., Bioinformatics 2005, 21(17): 3584-3585
http:/ / genesis.UGent.be/ dbtoolk it
Working with databases: DBToolkit
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
a) Enzymatic digestion using regular or ‘dual’ enzymes proteins to peptides
b) N-terminal or C-terminal ragging enhancing the information content of the database
c) Non-lossy redundancy clearing raising database information ratio
d) Create shuffled and reversed databases false-positives testing
e) Extract sequence-based subsets a priori prediction of potential success rate
f) Map peptides back to proteins (maximal annotation approach) find all matching proteins, and select primaries etc …
Summary of DBToolkit functionalities
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
http:/ / www.ebi.ac.uk/ pride/ dod
Database on Demand – DBToolkit online
See: Reisinger et al., Proteomics 2009, 9(18): 4421-4424
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
WHY DOES PROCESSING MATTER?
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781
Serum degradation over time
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781
Plasma degradation over time
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
TIME-LABILITY OF
SEQUENCE DATABASES
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Bringing the PPP from IPI 2.21 to IPI 3.13
1555 Total 1048 Unchanged 67% 507 Changed 33% Of which:
338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which
95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 2 UniProt 0% 1% (of ‘Defunct’)
1048 + 345 = 1386 recoverable (89.1%)
Both exist, 1 taxonomy now: RAT 1 immunoglobin
See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75
Example 1: HUPO PPP actualisation
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Bringing the P latelets from IPI 2.31 to IPI 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which:
78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which
5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’)
578 + 78 = 656 recoverable (97%)
See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75
Example 2: human blood platelets
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Adapted from: http:/ / www.ebi.ac.uk/ ipi
Proteins sometimes age badly
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
THE PICR MAPPING SERVICE
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Submit accessions OR sequences
(FASTA) with 500 entry interactive limit (no batch
limit)
Select output format
Select one or many databases to map to in one
request
Limit search by taxonomy
(pessimistic)
Choose to return all
mappings or only active ones
Run search
http:/ / www.ebi.ac.uk/ tools/ picr
See: Côté et al., BMC Bioinformatics 2007, 8: 401
Identifiers through (name)space and time
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Mapping results
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
ESTIMATING FALSE DISCOVERY RATES
THE DECOY DATABASE APPROACH
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
- Reversed databases (easy)
LENNARTMARTENS SNETRAMTRANNEL - Shuffled databases (slightly more difficult)
LENNARTMARTENS NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be)
LENNARTMARTENS GFVLAEPHSEAITK (for instance)
Three main types of decoy DB’s are used:
The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database.
Decoy databases, the latest fashion
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
hitsdecoynbrhitsforwardnbrhitsdecoynbrFDR
______2
+×
=
FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term.
From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214
Estimating the FDR (i)
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
hitsforwardnbrhitsdecoynbrFDR__
__=
This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set.
See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445
Estimating the FDR (ii)
See: Käll et al,., JPR 2008, 7(1): 29-34
BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011
Lennart Martens [email protected]
Thank you!
Questions?