Current state of proteomics standardization and (C … · Current state of proteomics...

31
9/6/2016 | 3 Current state of proteomics standardization and (C-)HPP data quality guidelines DTL focus meeting on data integration, standards and fair principles in proteomics Department of Pharmacy Analytical Biochemistry Péter Horvatovich

Transcript of Current state of proteomics standardization and (C … · Current state of proteomics...

9/6/2016 | 3

Current state of proteomics standardization and (C-)HPP data quality guidelines

DTL focus meeting on data integration, standards and fair principles in proteomics

Department of PharmacyAnalytical Biochemistry

Péter Horvatovich

9/6/2016 | 4

Organization of the Human Proteome Project

9/6/2016 | 5

Organization of C-HPP I.

http://c-hpp.webhosting.rug.nl/tiki-index.php

Integration ofC-HPP andB/D-HPPTeams

Biobanks

Organization of C-HPP II.

Slide from Mark Baker

9/6/2016 | 7First guideline of C-HPP

Paik YK, et al., Standard Guidelines for the Chromosome-Centric Human Proteome Project, PMID 22443261.

The key to making real headway

on the HPP is to agree on a

common , shared, globally

acceptable “big data” language

Slide from Mark Baker

ProteomeXchange

Individual lab-based MS data

PRIDE

MassIVE

GPMdb

PASSEL

PeptideAtlas

neXtProt

HPP Metrics

Human Protein Atlas

HPP Publications

HPP Guidelines

† neXtProt PE1-5 classifications PE1 =PE2 =PE3 =PE4 =PE5 =

The Human Proteome Project Workflow

Slide from Mark Baker

Slide from Lydie Lane

PE LevelNeXtProt

18/09/2013 version

%NeXtProt

12/02/2016 version

%

PE1Evidence at Protein Level

15,649 77.7 16,518 82.4

PE2Evidence at Transcript Level

only3,576 17.7 2290 11.4

PE3Inferred from Homology

198 1.0 565 2.8

PE4Predicted

94 0.5 94 0.5

PE5Uncertain

635 3.2 588 2.9

TOTAL 20,152 100 20,055 100

HPP/neXtProt protein existence data from 2013-2016

the

missing

proteins

Slide from Mark Baker

Metrics Used by HPP Teams

› Initial 2013 definition of “missing” was “no protein level data or insufficient documentation for ID” (PE2+PE3+PE4+PE5)

› In 2014, revised toPE2+PE3+PE4 as PE5 proteins considered dubious

Slide from Mark Baker

A new protein existence viewer

https://search.nextprot.org/view/statistics/protein-existence

Slide from Lydie Lane

PMID 24870542

PMID 24870543

9/6/2016 | 14Nature papers on the draft of Human proteome

84%

92%

1.Failure to use discriminating (proteotypic) from non-discriminating peptides

2.Inclusion of many low-quality MS spectra3.Use of short peptides (< 7aa containing peptides)4.Use of older d’base builds

Testing 2014 Claims of Credible MS evidence for 108/200 ORs

Slide from Mark Baker

16

133 million PSMs

1 million distinct peptides

14,000 canonical proteins

0.00009 PSM FDR

0.0002 Peptide FDR

0.01 Protein FDR

Only peptides ≥ 7 AA

0%

75%

100%

50%

25%

70%

Proteins

Human peptides in PeptideAtlas 2014-08

Slide from Eric Deutsch

17

Olfactory receptor evidences in PeptideAtlas

Slide from Eric Deutsch

18

Only 2 of neXtProt’s 473 olfactory receptors are canonical in PeptideAtlas

Olfactory receptors in PeptideAtlas

Slide from Eric Deutsch

19

Which protein does the peptide implicate?

Spectrum originally identified to: GYIVAAVVK

But a better and exact match is: GYIAVAVVK

But this latter sequence is not in our reference proteome.

Which is why it was not identified correctly.

Is it olfactory receptor OR5A2? (no other corroborating evidence)…GIVSVLVVLISYGYIVAAVVKISSATGRTKAFSTCASH…

GYIAVAVVK

Or is it serotransferrin (0.5 million PSMs)…SDNCEDTPEAGYFAIAVVKKSASDLTWDNLKGKKS…

GYIAVAVVK

I V dbSNP:rs2692696 is in our reference proteome from UniProt

F I not in our reference proteome. Not in neXtProt.

But this protein has many SNPs, and this may be the explanation

Slide from Eric Deutsch

20

Q9H255 = OR51E2

But GPMdb does have this one.

This is the only OR that Ron

Beavis thinks is legitimate.

But only observed with a single

peptide (many times) (in one

sample that PeptideAtlas doesn’t

have)

Ron Beavis:

If you check a little closer, the older

gene symbol for OR51E2 is

PSGR, a prostate-specific G-

coupled receptor protein (Cancer

Res. 2000 Dec 1;60(23):6568-72).

So, I'd actually suggest that this is

a true identification and that

interpreting the "OR" in the gene

name as being literally true is the

problem.

Slide from Eric Deutsch

Growth of Human Proteome with Large Datasets from 2014-2015

Note Savitski/Kuster reanalysis of Wilhelm et al: 14,741 proteins identified, MCP 2015Slide from Gilbert S. Omenn

PMID 27490519

Latest HPP Guideline

HUPO: MIAPE PSI Journals:- Journal of Proteome Research- Molecular and Cellular Proteomics- Proteomics Clinical Applications

NIH-NCI: proteogenomics guideline

HPP 1.0: data deposition at ProteomeXchange, FDR at PSM, peptide and proteins levelsHPP 2.0: MS data interpretation

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Manuscript detailing the process

Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission

Example dataset:

PXD000764

- Title: “Discovery of new CSF biomarkers for meningitis in children”

- 12 runs: 4 controls and 8 infected samples

- Identification and quantification data

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to

PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by

PRIDE, search engine output files will be stored and

provided in their original form.

3. Metadata: Sufficiently detailed description of sample origin,

workflow, instrumentation, submitter.

4. Other files: Optional files:

a. QUANT: Quantification related results e. FASTA

b. PEAK: Peak list files f. SP_LIBRARY

c. GEL: Gel images

d. OTHER: Any other file type

Published

RawFiles

Other files

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete vs Partial submissions: experimental metadata

Complete Partial

General experimental metadata about the projects is similar.

However, at the assay level information in partial submissions is not so detailed

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete

Partial

Complete vs Partial submissions: processed results

For complete submissions, it is possible to connect the spectra with the identification

processed results and they can be visualized.

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete submissions using mzIdentML

Search

Engine

Results +

MS files

Search

engines

mzIdentML

- Mascot

- MSGF+

- Myrimatch and related tools from D. Tabb’s lab

- OpenMS

- PEAKS

- ProCon (ProteomeDiscoverer, Sequest)

- Scaffold

- TPP via the idConvert tool (ProteoWizard)

- ProteinPilot (planned by the end of 2014)

- Others: library for X!Tandem conversion, lab

internal pipelines, …

An increasing number of tools support export to mzIdentML

1.1

- Referenced spectral files need to be submitted as well

(all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-

mzIdentML#.

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Tools ‘RESULT’ file generation Final ‘RESULT’ file

mzIdentML

‘RESULT’

Now: native file export

Spectra

files

Mascot

ProteinPilot

Scaffold

PEAKS

MSGF+

Others

Native File export

FDR accumulation when combining datasets

Manual Inspection of Extraordinary Claims

› Reviewers and readers (and authors) need to see this:

Slide from Eric Deutsch

Manual Inspection of Extraordinary Claims

› Reviewers and readers should not see this:

› This is what false positives look like

Slide from Eric Deutsch

Questions!

Acknowledgement of all collaborators and members of (C)-HPP participating on C-HPP workshops and HUPO

meetings

Thank you for you attention!