Do we need to make public our proteomics data?

41
Do we need to make public our proteomics data? Dr. Yasset Perez-Riverol Twitter: @ypriverol Github: ypriverol Bioinformatician - PRIDE Group Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK

Transcript of Do we need to make public our proteomics data?

Do we need to make public our proteomics

data?

Dr. Yasset Perez-Riverol

Twitter: @ypriverol

Github: ypriverol

Bioinformatician - PRIDE Group

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

I believe in open data, source, access …

An Integrated, Directed Mass Spectrometric Approach for In-depth

Characterization of Complex Peptide Mixtures. Mol Cell Proteomics.

Nov 2008; 7(11): 2138–2150

1 dataset (no cost) => 4 papers and 3 new

algorithms

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

I believe in open data, source, access policies…

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Overview

• Proteomics data deposition, bad practices and

experiences.

• PRIDE and ProteomeXchange

• PRIDE Components.

• Ongoing and Future work!

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Proteomics data deposition, bad practices,

experiences.

Protein Expression Databases

Processed Data

RAW Data

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

ProteomeXchange Consortium

• Goal: Development of a framework to allow

standard data submission and dissemination

pipelines between the main existing proteomics

repositories.

• Includes PeptideAtlas (ISB, Seattle), PRIDE

(Cambridge, UK) and MassIVE (UCSD, San Diego).

• Common identifier space (PXD identifiers)

• Two supported data workflows: MS/MS and SRM.

http://www.proteomexchange.org

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

ProteomeCentral

Metadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL

(SRM data)

PRIDE

(MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE

(MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

ProteomeXchange Partners: MassIVE (UCSD)

http://proteomics.ucsd.edu/service/massive/

• Just joined ProteomeXchange on June 2014

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

• Suitable for SRM assays

• Part of PeptideAtlas set of

resources.

http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012

ProteomeXchange Partners: PASSEL for SRM data

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

ProteomeXchange Partners: Pride

Vizcaíno et al., N. A Research, 2014

http://www.ebi.ac.uk/pride/archive/

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Current status of databases & repositories.

Protein resources

Protein Expression Databases

Processed Data & RAW DataPRIDE PASSEL

Chorus

MassIVE

Perez-Riverol Y, et al.

Proteomics. 2014

PeptideAtlas

GPMDB

proteomicsDBPaxDb

Human Proteinpedia

MaxQBPRIDE

PASSEL

Human Proteome Map

MOPED

UniProt

neXtProt

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Data rescue

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

PX Submission workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Partial submissions: For workflows not yet supported by

PRIDE, search engine output files will be stored and

provided in their original form.

b. Complete submissions: Result files can be converted

to PRIDE XML or the mzIdentML data standard.

3. Metadata: Sufficiently detailed description of sample origin,

workflow, instrumentation, submitter based on Ontologies and

Controlled Vocabularies.

4. Other files: Optional files:

a. QUANT: Quantification related results e. FASTA

b. PEAK: Peak list files

c. OTHER: Any other file type

Published

RawFiles

Other files

Ternent et al., Proteomics, 2014

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Complete submissions (mzIdentML)

Search

Engine

Results +

MS files

Search

engines

mzIdentML

- Mascot

- MSGF+

- Myrimatch and related tools from D. Tabb’s lab

- OpenMS

- PEAKS

- ProCon (ProteomeDiscoverer, Sequest)

- Scaffold

- TPP via the idConvert tool (ProteoWizard)

- ProteinPilot (planned by the end of 2014)

- Others: library for X!Tandem conversion, lab

internal pipelines, …

An increasing number of tools support export to mzIdentML

1.1

- Referenced spectral files need to be submitted as well

(all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-

mzIdentML#.

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Universal file format (mzTab)

http://mztab.googlecode.com

• Basic information about experiment and sample

• Key-Value pairsMetadata

• Basic information about protein identifications

• Table-basedProtein

• Information about quantified peptides

• Table-basedPeptide

• Information about identified spectra

• Table-basedPSM

• Basic information about identified small molecules

• Table-basedSmall Molecule

J. Griss et al., MCP, 2014

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

PRIDE Components: Submission Process

PRIDE Converter PRIDE Inspector PX Submission Tool

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

• Capture the mappings between the different types of files.

• Add the mandatory metadata annotation.

• Make the file upload process straightforward to the submitter (It transfers all the

files using Aspera or FTP).

• Command line alternative: some scripting is needed.

PRIDE Components: PX submission tool

Published

Raw

Other files

http://www.proteomexchange.org/submission

PX

submission

tool

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Available for complete submissions

Wang et al., Nat. Biotechnology, 2012

PRIDE Inspector 2.0

PRIDE Inspector 2.0 supports:

- PRIDE XML

- mzIdentML + all types of spectra files

- mzML- mzTab Quantitation (work in progress)

https://github.com/PRIDE-Toolsuite/

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Pride Components: Services & Web components

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

ProteomeXchange: 1329 datasets up until October 2014

Origin:

293 USA

184 Germany

143 UK

83 France

82 Netherlands

78 China

62 Switzerland

46 Spain

45 Belgium

45 Canada

42 Denmark

37 Australia

37 Japan

34 Sweden

26 Austria

21 Brazil

21 Taiwan

21 India

20 Norway

19 Finland

17 Ireland

14 Italy

12 Republic of Korea

8 Israel 9 Singapore

8 Russia

Type:

437 PRIDE complete

792 PRIDE partial

63 PeptideAtlas/PASSEL complete

14 MassIVE

23 reprocessed

Publicly Accessible:

691 datasets, 52% of all

86% PRIDE

12% PASSEL

2% MassIVE

Data volume:

Total: ~55 TB

Number of all files: ~131,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10

datasets:

577 Homo sapiens

165 Mus musculus

56 Saccharomyces cerevisiae

53 Arabidopsis thaliana

29 Rattus norvegicus

22 Escherichia coli

17 Bos taurus

16 Mycobacterium tuberculosis

13 Oryza sativa

13 Drosophila melanogaster

13 Glycine max

~ 290 species in total

Datasets/year:

2012: 102

2013: 527

2014: 700

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Brazil Submissions:

21 Projects

11 PXD Public

10 PXD Private

Main Contributors:

Martins-de-Souza D. PhD (6)

Domont G. Prof (4)

Carvalho PC. (4)

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Journals and Data Deposition

Journal

Nu

mb

er

of S

ub

mis

sio

ns

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Data Access ?Tota

l N

um

bers

PXD Identifier Hits Dataset title

PXD000561 153512 A draft map of the human proteome

PXD000865 51639 Mass spectrometry based draft of the human proteome

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Ongoing and future work.

• Quality assessment of complete submissions.

• Make the data reusable and reusable.

• Integration of different Protein expression resources

• PRIDE

• PeptideAtlas

• ProteomicsDB

• Human Proteome Map

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

QC with PRIDE Inspector

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

QC with PRIDE Inspector

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

QC with PRIDE Inspector

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

QC PRIDE Inspector and Quantitation

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Validation of controversial data

• Analysis of Tyrannosaurus rex fossils: controversial presence of

collagen (is it a contamination of the sample?)

Asara et al. (2007) Science 316: 280-5.

Asara et al. (2007) Science 316: 1324-5.

Bern et al. (2009) JPR 9: 4328-32PRIDE assay accession 8633

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Quality control: PRIDE Cluster

• Data integration across many experiments before filtering

• Assumption: The same peptide will generate the same MS/MS

spectrum in many experiments

• Cluster all spectra in PRIDE

• Those clusters which contain only/mainly one peptide are considered

reliable

NMMAACDPR

NMMAACDPR

PPECPDFDPPRNMMAACDPR

Consensus

PPECPDFDPPR

Griss, et. al. Nature Met. 2012

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

PRIDE Cluster

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Spectral libraries

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Sneak peak of the new PRIDE Cluster web

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Make data available and reusable.

•Around 60% of the data used for the

analysis comes from previous

experiments, most of them stored in

proteomics repositories such as

PRIDE/ProteomeXchange, PASSEL or

MassIVE.

Perez-Riverol Y, et al.

Proteomics. 2014

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Vaudel M, Barsnes H, Berven FS, Sickmann A,

Martens L:

Proteomics 2011;11(5):996-9.

http://searchgui.googlecode.com http://peptide-shaker.googlecode.com

Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L,

Barsnes H:

Nature Biotechnology (in press)

CompOmics Open Source Analysis Pipeline

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Find the desired PRIDE project …

… and start re-analyzing the data!

… inspect the project details ….

Reshake PRIDE data!

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Current status of databases & repositories.

Protein resources

Protein Expression Databases

Processed Data & RAW DataPRIDE PASSEL

Chorus

MassIVE

Perez-Riverol Y, et al.

Proteomics. 2014

PeptideAtlas

GPMDB

proteomicsDBPaxDb

Human Proteinpedia

MaxQBPRIDE

PASSEL

Human Proteome Map

MOPED

UniProt

neXtProt

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

PROXI Clients

Repositories

&

Databases

Web Services PROXI PROXI PROXI PROXIPROXI

Registry

Data

Perez-Riverol Y, Proteomics, 20014

Integration of different Protein expression

resources

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Conclusions

• ProteomeXchange is widely used.

• PRIDE contains most of the MS/MS datasets.

• It has now a new consortium member: MassIVE (UCSD).

• Around half of the datasets are already public.

• Different open source tools available to facilitate the process:

• File transfer speed should not be a problem (Aspera support)

• Data depostion enables and promotes data reuse.

• ProteomeXchange is open to new members.

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Acknowledgements

PRIDE Team

Juan A. Vizcaino (Group Leader)

Attila Csordas

Rui Wang

Florian Reisinger

Jose A. Dianes

Tobias Ternent

Noemi del Toro

Henning Hermjakob

PeptideAtlas Team (ISB, Seattle)

Eric Deutsch

Terry Farrah

Zhi Sun

MAssIVE

Nuno Bandeira

And many other PX partners and

stakeholders

Yasset Perez-Riverol [email protected]

BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)

Questions?