BITS - Overview of sequence databases for mass spectrometry data analysis

Post on 07-Dec-2014

1.088 views 4 download

description

This is the fourth presentation of the BITS training on 'Mass spec data processing'. It review sequences databases and their flaws in light of mass spectrometry data analysis.Thanks to the Compomics Lab of the VIB for their contribution.

Transcript of BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

sequence databases

lennart martens

lennart.martens@ugent.be

Computational Omics and Systems Biology Group

Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University

Ghent, Belgium

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

PEPTIDES AND REDUNDANCY

IN SEQUENCE DATABASES

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

>Protein 1 LENNARTMARTENS >Protein 2 LENNARTMARTENT

>Protein 1 (1-6) LENNAR >Protein 1 (7-10) TMAR >Protein 1 (11-14) TENS >Protein 2 (1-6) LENNAR >Protein 2 (7-10) TMAR >Protein 2 (11-14) TENT

non-redundant protein DB

≠ non-redundant peptide DB

= =

Database content: all peptide sequences in the database Database information: number of unique peptide sequences

Database information ratio: database information

database content

Peptide-level sequence redundancy

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

93%

41% 45%

42%

23%

1,584,806

3,186,806 3,491,778 4,472,356

10,307,319

1,466,927 1,309,625 1,559,685

1,877,500

2,394,844

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

UniProtKB/SwissProthuman

UniProtKB/TrEMBLhuman

Ensembl human IPI human NCBI nr human

ratio Content information

Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da.

Information ratios for common databases

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

ENRICHING SEQUENCE DATABASES

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

In vivo processing

N C

N C

Enzymatic digest and subsequent NH2-terminal peptide isolation

Not in the sequence database!

+

Search base

ID miss

The influence of the sequence database

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Mitochondrial Isovaleryl-coA Dehydrogenase

N-terminal transit peptide (1-29) MATATRLLGWRVASWRLRPPLAGFVS

QRAHSLLPVDDAINGLSEEQRQLRE…

…LDGIQCFGGNGYINDFPMGRFLRDA KLYEIGAGTSEVRRLVIGRAFNADFH

30 47

423

Isovaleryl-CoA dehydrogenase (30 – 423)

An example

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Revised search base

ID Search

base ID miss

AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR

……

Extending the information content

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Caspase cleavage of this protein (for 50%)

NH2 COOH

R R D R R

COOH R R

D R R NH2

COOH R R

NH2

R NH2

COOH

NH2-terminal peptide isolation

NOT IN DB!

R R NH2 COOH

D

R NH2 COOH

Another example: in vivo protein cleavage

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

R NH2 COOH

result of in vivo protease

result of in vitro trypsin

Title:dualArgC_Cathep Cleavage:DXR Restrict:P Cterm

Title:Arg-C Cleavage:R Restrict:P Cterm

Arg-C definition Arg-C (N-term), Cathepsin (C-term)

definition

Creation of a bifunctional enzyme will generate the correct peptides!

Solving the issue: bifunctional enzymes

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

DBTOOLKIT AND

DATABASE ON DEMAND

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

See: Martens et al., Bioinformatics 2005, 21(17): 3584-3585

http:/ / genesis.UGent.be/ dbtoolk it

Working with databases: DBToolkit

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

a) Enzymatic digestion using regular or ‘dual’ enzymes proteins to peptides

b) N-terminal or C-terminal ragging enhancing the information content of the database

c) Non-lossy redundancy clearing raising database information ratio

d) Create shuffled and reversed databases false-positives testing

e) Extract sequence-based subsets a priori prediction of potential success rate

f) Map peptides back to proteins (maximal annotation approach) find all matching proteins, and select primaries etc …

Summary of DBToolkit functionalities

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

http:/ / www.ebi.ac.uk/ pride/ dod

Database on Demand – DBToolkit online

See: Reisinger et al., Proteomics 2009, 9(18): 4421-4424

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

WHY DOES PROCESSING MATTER?

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Serum degradation over time

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Plasma degradation over time

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

TIME-LABILITY OF

SEQUENCE DATABASES

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Bringing the PPP from IPI 2.21 to IPI 3.13

1555 Total 1048 Unchanged 67% 507 Changed 33% Of which:

338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which

95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 2 UniProt 0% 1% (of ‘Defunct’)

1048 + 345 = 1386 recoverable (89.1%)

Both exist, 1 taxonomy now: RAT 1 immunoglobin

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 1: HUPO PPP actualisation

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Bringing the P latelets from IPI 2.31 to IPI 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which:

78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which

5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’)

578 + 78 = 656 recoverable (97%)

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 2: human blood platelets

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Adapted from: http:/ / www.ebi.ac.uk/ ipi

Proteins sometimes age badly

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

THE PICR MAPPING SERVICE

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Submit accessions OR sequences

(FASTA) with 500 entry interactive limit (no batch

limit)

Select output format

Select one or many databases to map to in one

request

Limit search by taxonomy

(pessimistic)

Choose to return all

mappings or only active ones

Run search

http:/ / www.ebi.ac.uk/ tools/ picr

See: Côté et al., BMC Bioinformatics 2007, 8: 401

Identifiers through (name)space and time

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Mapping results

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

ESTIMATING FALSE DISCOVERY RATES

THE DECOY DATABASE APPROACH

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

- Reversed databases (easy)

LENNARTMARTENS SNETRAMTRANNEL - Shuffled databases (slightly more difficult)

LENNARTMARTENS NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be)

LENNARTMARTENS GFVLAEPHSEAITK (for instance)

Three main types of decoy DB’s are used:

The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database.

Decoy databases, the latest fashion

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

hitsdecoynbrhitsforwardnbrhitsdecoynbrFDR

______2

=

FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term.

From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214

Estimating the FDR (i)

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

hitsforwardnbrhitsdecoynbrFDR__

__=

This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set.

See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445

Estimating the FDR (ii)

See: Käll et al,., JPR 2008, 7(1): 29-34

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens lennart.martens@ugent.be

Thank you!

Questions?