PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data deposition

EMBL-EBI Now and in the Future

PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data depositionDr. Juan Antonio Vizcano

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016

Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression AtlasMetaboLightsPRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene, protein & metabolite expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.

2

PRIDE and ProteomeXchange

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

Some examples of public data reuse

Overview

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.

Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported

Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository


PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) databasehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Journal Submission RecommendationsJournal guidelines recommend submission to proteomics repositories:Proteomics (dataset briefs)JPR (HPP papers)Molecular and Cellular ProteomicsJournals from the Nature groupJournals from the PLOS group

Funding agencies are enforcing public deposition of data to maximize the value of the funds provided.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE: Source of MS proteomics data

PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the Expression Atlas.

http://www.ebi.ac.uk/pride

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Explain that PRIDE is working in two main directions: Develop submission/dissemination pipelines of MS proteomics data involving the main proteomics resources (ProteomeXchange consortium), Integrate proteomics information (peptide/protein expression data) with other EBI resources like Ensembl (Genomics), the Expression Atlas (transcriptomics) and UniProt (to protein sequence information). Proteomics data is needed to have a more complete picture of biology. 7

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgVizcano et al., Nat Biotechnol, 2014


ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

jPOST(MS/MS data)

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow


10

ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL


MassIVE



DATASETS

SRM data

Reprocessed results



11

ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs Receiving repositories

PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB


MassIVE



DATASETS

OmicsDIIntegration with other omics datasets

SRM data

Reprocessed results



12

Countries with at least 100 datasets: 1105 USA 546 Germany 411 United Kingdom 356 China 229 France 188 Netherlands 178 Canada 150 Switzerland 125 Australia 123 Spain 123 Denmark 117 Japan 101 Sweden

ProteomeXchange: 4,534 datasets up until 31st July, 2016Type: 4067 PRIDE (~90%) 339 MassIVE 115 PeptideAtlas/PASSEL 13 jPOSTPublicly Accessible: 2597 datasets, 57% of all 2334 PRIDE 135 MassIVE 115 PASSEL 13 jPOST

Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 1758 2016 (till end of July): 1184Top Species studied by at least 100 datasets:2010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus 936 reported taxa in total


PRIDE Archive (in the context of ProteomeXchange and the PSI standards)




Overview


Journals

UniProt/neXtProt

Peptide Atlas


PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB


MassIVE



DATASETS


SRM data

Reprocessed results



15

CompletePartialComplete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identificationprocessed results (results can be parsed) and they can be visualized.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Complete vs Partial submissions: experimental metadata

CompletePartialGeneral experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Partial submissions can be used to store other data typesEverything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets

PRIDE does not store SRM data (it goes to PASSEL)

Top down proteomics datasets.

Mass Spectrometry Imaging datasets.

Data independent acquisition techniques: e.g. SWATH-MS datasets.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016How to perform a complete PX submission to PRIDE

Decide between a complete/partial submission.

File conversion/export to mzIdentML (or PRIDE XML)

File check before submission (PRIDE Inspector)

Experimental annotation and actual file submission (PX submission tool)

Post-submission steps


PX Data workflow for MS/MS data

Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (also PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files:QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files


20

PX Data workflow for MS/MS data

Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (also PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files (the list can be extended):QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files


21

PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XML1

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Tools

RESULT file generationFinal RESULT file mzIdentML RESULTNow: native file export to mzIdentMLSpectra files(mzML, mzXML, mzData, mgf, pkl, ms2, dta, apl)MascotProteinPilotScaffoldPEAKSMSGF+PLGS

Native File export Others

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Put logo here23

Complete submissionsSearch Engine Results + MS filesSearch enginesmzIdentML

Mascot MSGF+ MyriMatch and related tools from D. Tabbs lab OpenMS PEAKS PeptideShaker ProCon (ProteomeDiscoverer, Sequest) Scaffold TPP via the idConvert tool (ProteoWizard) ProteinPilot (from version 5.0) X!Tandem native conversion (Beta, PILEDRIVER) Others: library for X!Tandem conversion, lab internal pipelines, Crux

An increasing number of tools support export to mzIdentML 1.1

Referenced spectral files need to be submitted as well (all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.


24

Complete submissions: which tools are missingSearch Engine Results + MS filesSearch enginesmzIdentML

MaxQuant: Export to mzTab (work in progress)

Proteome Discoverer (Thermo): Work in progress

An increasing number of tools support export to mzIdentML 1.1

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.


25


mzIdentMLPRIDE XML2

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Inspector Toolsuite

Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016

PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.

https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector


27

PRIDE Inspector Functionality

Summary and QC charts

Peptide spectra annotation and visualisationProtein groups inference

Protein view containing protein inference informationQuantification view Multiple export options (.mgf, protein/peptide tables, mzTab file)Direct access to PRIDE datasetsSummary and QC charts (Delta m/z, precursor charges, etc.)Spectra view (fragmentation table, ion series annotation)Protein inference algorithm and protein groups visualisation


28


mzIdentMLPRIDE XML3


Capture the mappings between the different types of files.

Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).

PX submission toolPublished RawOther files

http://www.proteomexchange.org/submissionPXsubmissiontool

Command line alternative: Using the Aspera file transfer protocol.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PX submission tool: screenshots

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Fast file transfer with Aspera

- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line

- Up to 50X faster than FTP

File transfer speed should not be a problem!!

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Manuscript published detailing the process

Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission

Example dataset:PXD000764- Title: Discovery of new CSF biomarkers for meningitis in children- 12 runs: 4 controls and 8 infected samples- Identification and quantification data

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Public data release: when does it happen?

When the author tells us to do it (the authors can do it by themselves)

When we find out that a dataset has been published

We look for PXD identifiers in PubMed abstracts.

If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!

New web form in the PRIDE web to facilitate the process





PRIDE Cluster and PRIDE Proteomes

Overview


Journals

UniProt/neXtProt

Peptide Atlas


PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB


MassIVE



DATASETS


SRM data

Reprocessed results



36

ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016RSS and Twitter feeds for public datasets

http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml @proteomexchange

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Ways to access data in PRIDE Archive

PRIDE web interface

File repository

REST web service

PRIDE Inspector tool

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L: Proteomics 2011;11(5):996-9.https://github.com/compomics/searchguihttps://github.com/compomics/peptide-shakerVaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes H:Nature Biotechnology 2015; 33(1):22-24.

CompOmics Open Source Analysis Pipeline


Find the desired PRIDE project and start re-analyzing the data! inspect the project details .Reshake PRIDE data!






Overview

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Datasets are being reused more and more.

Data download volume for PRIDE in 2015: ~ 200 TB

Vaudel et al., Proteomics, 2016


43

Challenges for data reuse in proteomicsInsufficient technical and biological metadata.

Large computational infrastructure maybe needed (e.g. when analysing many datasets together).

Shortage of expertise (people).

Lack of standardisation in the field.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Data sharing in Proteomics


Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Cluster

Provide an aggregated peptide centric view of PRIDE ArchiveHypothesis: same peptide will generate similar MS/MS spectra across experimentsNew spectral clustering algorithm to reliably group spectra coming from the same peptide Infer reliable identifications by comparing submitted identifications of spectra within a cluster

After clustering, a representative spectrum is built for all peptides consistently identified across different datasetsGriss et al., Nat. Methods, 2013Griss et al., Nat. Methods, 2016


47

Examples: one perfect cluster

880 PSMs give the same peptide ID4 species28 datasetsSame instruments

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Examples: one perfect cluster (2)

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Cluster as a Public Data Mining Resource50http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API


Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Reprocess

Data are reprocessed with the intention of obtaining new knowledge or to provide an updated view on the results.

It mainly serves the same purpose of the original experiment.

For instance, a shot-gun dataset can be reprocessed with a different algorithm or an updated sequence database.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Reprocessing repositoriesThese resources collect MS raw data and reprocess it using one given analysis pipeline, and an up-to date protein sequence database.

Main resources: GPMDB and PeptideAtlas (ISB, Seattle).

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PeptideAtlas builds

Examples of builds:

- Human Human plasma Human urine Drosophila Mouse Mouse plasma Cow Yeast

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Kim et al., Nature, 2014

Two independent groups claimed to have produced the first complete draft of the human proteome by MS.

Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.

They used many different tissues.Nature cover 29 May 2014

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.

They complement that data with exotic tissues.

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016OmicsDI: Portal for omics datasetshttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB

ArrayExpressExpression Atlas

MetaboLightsMetabolomics WorkbenchGNPS

EGAPerez-Riverol et al., 2016, BioRXxiv

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.57

OmicsDI: Portal for omics datasets

Perez-Riverol et al., 2016, BioRXxiv

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.58

Main characteristics of PRIDE and ProteomeXchange

PX/PRIDE submission workflow for MS/MS dataPRIDE InspectorPX submission tool

PRIDE/ProteomeXchange has become the de facto standard for data submission and data availability in proteomics

Reuse/ reanalysis of proteomics data -> Many possible applications

Conclusions

Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!


PSI Spring Meeting 2017

Beijing Proteome Research Center, ChinaApril 24-26, 2017April 23: 2nd PHOENIX Mini-Symposium on Frontiers of ProteomicsApril 27: Hiking the Great Wall

Focus topics:Quality control: qcMLProteogenomics formatsproXI: proteomics eXpression InterfacePrivacy and Proteomics Data


61

PXD identifierHits/ No files = dataset downloadsDataset Title

PXD00056146578/ 2383 = 20A draft map of the human proteome

PXD00158713435/140 = 96

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

PRD00006612748/4090 = 3

Quantitative Proteomics Analysis of the Secretory Pathway

PXD0006584004/460 = 9

Global phosphoproteomic profiling reveals distinct signatures in B-cell non-Hodgkin

PXD0001493781/598 = 6The potato tuber mitochondrial proteome

PXD00086512535/1368 = 9Mass spectrometry based draft of the human proteome

PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data deposition

Science

Transcript of PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data deposition