BITS - Overview of sequence databases for mass spectrometry data analysis

30
http://www.bits.vib.be/training

description

This is the fourth presentation of the BITS training on 'Mass spec data processing'. It review sequences databases and their flaws in light of mass spectrometry data analysis.Thanks to the Compomics Lab of the VIB for their contribution.

Transcript of BITS - Overview of sequence databases for mass spectrometry data analysis

Page 2: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

sequence databases

lennart martens

[email protected]

Computational Omics and Systems Biology Group

Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University

Ghent, Belgium

Page 3: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

PEPTIDES AND REDUNDANCY

IN SEQUENCE DATABASES

Page 4: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

>Protein 1 LENNARTMARTENS >Protein 2 LENNARTMARTENT

>Protein 1 (1-6) LENNAR >Protein 1 (7-10) TMAR >Protein 1 (11-14) TENS >Protein 2 (1-6) LENNAR >Protein 2 (7-10) TMAR >Protein 2 (11-14) TENT

non-redundant protein DB

≠ non-redundant peptide DB

= =

Database content: all peptide sequences in the database Database information: number of unique peptide sequences

Database information ratio: database information

database content

Peptide-level sequence redundancy

Page 5: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

93%

41% 45%

42%

23%

1,584,806

3,186,806 3,491,778 4,472,356

10,307,319

1,466,927 1,309,625 1,559,685

1,877,500

2,394,844

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

UniProtKB/SwissProthuman

UniProtKB/TrEMBLhuman

Ensembl human IPI human NCBI nr human

ratio Content information

Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da.

Information ratios for common databases

Page 6: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

ENRICHING SEQUENCE DATABASES

Page 7: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

In vivo processing

N C

N C

Enzymatic digest and subsequent NH2-terminal peptide isolation

Not in the sequence database!

+

Search base

ID miss

The influence of the sequence database

Page 8: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Mitochondrial Isovaleryl-coA Dehydrogenase

N-terminal transit peptide (1-29) MATATRLLGWRVASWRLRPPLAGFVS

QRAHSLLPVDDAINGLSEEQRQLRE…

…LDGIQCFGGNGYINDFPMGRFLRDA KLYEIGAGTSEVRRLVIGRAFNADFH

30 47

423

Isovaleryl-CoA dehydrogenase (30 – 423)

An example

Page 9: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Revised search base

ID Search

base ID miss

AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR

……

Extending the information content

Page 10: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Caspase cleavage of this protein (for 50%)

NH2 COOH

R R D R R

COOH R R

D R R NH2

COOH R R

NH2

R NH2

COOH

NH2-terminal peptide isolation

NOT IN DB!

R R NH2 COOH

D

R NH2 COOH

Another example: in vivo protein cleavage

Page 11: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

R NH2 COOH

result of in vivo protease

result of in vitro trypsin

Title:dualArgC_Cathep Cleavage:DXR Restrict:P Cterm

Title:Arg-C Cleavage:R Restrict:P Cterm

Arg-C definition Arg-C (N-term), Cathepsin (C-term)

definition

Creation of a bifunctional enzyme will generate the correct peptides!

Solving the issue: bifunctional enzymes

Page 12: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

DBTOOLKIT AND

DATABASE ON DEMAND

Page 13: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

See: Martens et al., Bioinformatics 2005, 21(17): 3584-3585

http:/ / genesis.UGent.be/ dbtoolk it

Working with databases: DBToolkit

Page 14: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

a) Enzymatic digestion using regular or ‘dual’ enzymes proteins to peptides

b) N-terminal or C-terminal ragging enhancing the information content of the database

c) Non-lossy redundancy clearing raising database information ratio

d) Create shuffled and reversed databases false-positives testing

e) Extract sequence-based subsets a priori prediction of potential success rate

f) Map peptides back to proteins (maximal annotation approach) find all matching proteins, and select primaries etc …

Summary of DBToolkit functionalities

Page 15: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

http:/ / www.ebi.ac.uk/ pride/ dod

Database on Demand – DBToolkit online

See: Reisinger et al., Proteomics 2009, 9(18): 4421-4424

Page 16: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

WHY DOES PROCESSING MATTER?

Page 17: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Serum degradation over time

Page 18: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

From: Yi et al., Journal of Proteome Research 2007, 6(5): 1768-1781

Plasma degradation over time

Page 19: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

TIME-LABILITY OF

SEQUENCE DATABASES

Page 20: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Bringing the PPP from IPI 2.21 to IPI 3.13

1555 Total 1048 Unchanged 67% 507 Changed 33% Of which:

338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which

95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 2 UniProt 0% 1% (of ‘Defunct’)

1048 + 345 = 1386 recoverable (89.1%)

Both exist, 1 taxonomy now: RAT 1 immunoglobin

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 1: HUPO PPP actualisation

Page 21: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Bringing the P latelets from IPI 2.31 to IPI 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which:

78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which

5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’)

578 + 78 = 656 recoverable (97%)

See: Martens and Mueller et al., Proteomics 2006, 6(18):5059-75

Example 2: human blood platelets

Page 22: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Adapted from: http:/ / www.ebi.ac.uk/ ipi

Proteins sometimes age badly

Page 23: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

THE PICR MAPPING SERVICE

Page 24: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Submit accessions OR sequences

(FASTA) with 500 entry interactive limit (no batch

limit)

Select output format

Select one or many databases to map to in one

request

Limit search by taxonomy

(pessimistic)

Choose to return all

mappings or only active ones

Run search

http:/ / www.ebi.ac.uk/ tools/ picr

See: Côté et al., BMC Bioinformatics 2007, 8: 401

Identifiers through (name)space and time

Page 25: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Mapping results

Page 26: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

ESTIMATING FALSE DISCOVERY RATES

THE DECOY DATABASE APPROACH

Page 27: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

- Reversed databases (easy)

LENNARTMARTENS SNETRAMTRANNEL - Shuffled databases (slightly more difficult)

LENNARTMARTENS NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be)

LENNARTMARTENS GFVLAEPHSEAITK (for instance)

Three main types of decoy DB’s are used:

The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database.

Decoy databases, the latest fashion

Page 28: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

hitsdecoynbrhitsforwardnbrhitsdecoynbrFDR

______2

=

FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term.

From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214

Estimating the FDR (i)

Page 29: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

hitsforwardnbrhitsdecoynbrFDR__

__=

This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set.

See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445

Estimating the FDR (ii)

See: Käll et al,., JPR 2008, 7(1): 29-34

Page 30: BITS - Overview of sequence databases for mass spectrometry data analysis

BITS MS Data Processing – Sequence Databases UGent, Gent, Belgium – 16 December 2011

Lennart Martens [email protected]

Thank you!

Questions?