BITS - Overview of sequence databases for mass spectrometry data analysis

http://www.bits.vib.be/training

http://www.bits.vib.be/training

http://creativecommons.org/licenses/by-sa/3.0/

http://www.bits.vib.be/

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

sequence databases

lennart martens

[email protected]

Computational Omics and Systems Biology Group

Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University

Ghent, Belgium



PEPTIDES AND REDUNDANCY

IN SEQUENCE DATABASES



>Protein 1 LENNARTMARTENS >Protein 2 LENNARTMARTENT

>Protein 1 (1-6) LENNAR >Protein 1 (7-10) TMAR >Protein 1 (11-14) TENS >Protein 2 (1-6) LENNAR >Protein 2 (7-10) TMAR >Protein 2 (11-14) TENT

non-redundant protein DB

≠ non-redundant peptide DB

= =

Database content: all peptide sequences in the database Database information: number of unique peptide sequences

Database information ratio: database information

database content

Peptide-level sequence redundancy



93%

41% 45%

42%

23%

1,584,806

3,186,806 3,491,778 4,472,356

10,307,319

1,466,927 1,309,625 1,559,685

1,877,500

2,394,844

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

UniProtKB/SwissProthuman

UniProtKB/TrEMBLhuman

Ensembl human IPI human NCBI nr human

ratio Content information

Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da.

Information ratios for common databases



ENRICHING SEQUENCE DATABASES



In vivo processing

N C

N C

Enzymatic digest and subsequent NH2-terminal peptide isolation

Not in the sequence database!

+

Search base

ID miss

The influence of the sequence database



Mitochondrial Isovaleryl-coA Dehydrogenase

N-terminal transit peptide (1-29) MATATRLLGWRVASWRLRPPLAGFVS

QRAHSLLPVDDAINGLSEEQRQLRE…

…LDGIQCFGGNGYINDFPMGRFLRDA KLYEIGAGTSEVRRLVIGRAFNADFH

30 47

423

Isovaleryl-CoA dehydrogenase (30 – 423)

An example



Revised search base

ID Search

base ID miss

AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR

……

Extending the information content



Caspase cleavage of this protein (for 50%)

NH2 COOH

R R D R R

COOH R R

D R R NH2

COOH R R

NH2

R NH2

COOH

NH2-terminal peptide isolation

NOT IN DB!

R R NH2 COOH

D

R NH2 COOH

Another example: in vivo protein cleavage



R NH2 COOH

result of in vivo protease

result of in vitro trypsin

Title:dualArgC_Cathep Cleavage:DXR Restrict:P Cterm

Title:Arg-C Cleavage:R Restrict:P Cterm

Arg-C definition Arg-C (N-term), Cathepsin (C-term)

definition

Creation of a bifunctional enzyme will generate the correct peptides!

Solving the issue: bifunctional enzymes



DBTOOLKIT AND

DATABASE ON DEMAND



See: Martens et al., Bioinformatics 2005, 21(17): 3584-3585

http:/ / genesis.UGent.be/ dbtoolk it

Working with databases: DBToolkit



a) Enzymatic digestion using regular or ‘dual’ enzymes proteins to peptides

b) N-terminal or C-terminal ragging enhancing the information content of the database

c) Non-lossy redundancy clearing raising database information ratio

d) Create shuffled and reversed databases false-positives testing

e) Extract sequence-based subsets a priori prediction of potential success rate

f) Map peptides back to proteins (maximal annotation approach) find all matching proteins, and select primaries etc …

Summary of DBToolkit functionalities



http:/ / www.ebi.ac.uk/ pride/ dod

Database on Demand – DBToolkit online

See: Reisinger et al., Proteomics 2009, 9(18): 4421-4424



WHY DOES PROCESSING MATTER?



From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Serum degradation over time



From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Plasma degradation over time



TIME-LABILITY OF

SEQUENCE DATABASES



Bringing the PPP from IPI 2.21 to IPI 3.13

1555 Total 1048 Unchanged 67% 507 Changed 33% Of which:

338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which

95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 2 UniProt 0% 1% (of ‘Defunct’)

1048 + 345 = 1386 recoverable (89.1%)

Both exist, 1 taxonomy now: RAT 1 immunoglobin

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 1: HUPO PPP actualisation



Bringing the P latelets from IPI 2.31 to IPI 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which:

78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which

5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’)

578 + 78 = 656 recoverable (97%)

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 2: human blood platelets



Adapted from: http:/ / www.ebi.ac.uk/ ipi

Proteins sometimes age badly



THE PICR MAPPING SERVICE



Submit accessions OR sequences

(FASTA) with 500 entry interactive limit (no batch

limit)

Select output format

Select one or many databases to map to in one

request

Limit search by taxonomy

(pessimistic)

Choose to return all

mappings or only active ones

Run search

http:/ / www.ebi.ac.uk/ tools/ picr

See: Côté et al., BMC Bioinformatics 2007, 8: 401

Identifiers through (name)space and time



Mapping results



ESTIMATING FALSE DISCOVERY RATES

THE DECOY DATABASE APPROACH



- Reversed databases (easy)

LENNARTMARTENS SNETRAMTRANNEL - Shuffled databases (slightly more difficult)

LENNARTMARTENS NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be)

LENNARTMARTENS GFVLAEPHSEAITK (for instance)

Three main types of decoy DB’s are used:

The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database.

Decoy databases, the latest fashion



hitsdecoynbrhitsforwardnbrhitsdecoynbrFDR

______2

+×

=

FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term.

From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214

Estimating the FDR (i)



hitsforwardnbrhitsdecoynbrFDR__

__=

This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set.

See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445

Estimating the FDR (ii)

See: Käll et al,., JPR 2008, 7(1): 29-34



Thank you!

Questions?

BITS - Overview of sequence databases for mass spectrometry data analysis

Technology

Transcript of BITS - Overview of sequence databases for mass spectrometry data analysis