Data Submission Guidelines for the ProteomeXchange Consortium
PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data deposition
-
Upload
juan-antonio-vizcaino -
Category
Science
-
view
66 -
download
1
Transcript of PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data deposition
EMBL-EBI Now and in the Future
PRIDE and ProteomeXchange: supporting the cultural change in proteomics public data depositionDr. Juan Antonio Vizcano
Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression AtlasMetaboLightsPRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene, protein & metabolite expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.
2
PRIDE and ProteomeXchange
How to submit data to PRIDE: PRIDE tools
How to access data in PRIDE Archive
Some examples of public data reuse
Overview
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.
Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported
Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information
Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) databasehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Journal Submission RecommendationsJournal guidelines recommend submission to proteomics repositories:Proteomics (dataset briefs)JPR (HPP papers)Molecular and Cellular ProteomicsJournals from the Nature groupJournals from the PLOS group
Funding agencies are enforcing public deposition of data to maximize the value of the funds provided.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE: Source of MS proteomics data
PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the Expression Atlas.
http://www.ebi.ac.uk/pride
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Explain that PRIDE is working in two main directions: Develop submission/dissemination pipelines of MS proteomics data involving the main proteomics resources (ProteomeXchange consortium), Integrate proteomics information (peptide/protein expression data) with other EBI resources like Ensembl (Genomics), the Expression Atlas (transcriptomics) and UniProt (to protein sequence information). Proteomics data is needed to have a more complete picture of biology. 7
ProteomeXchange: A Global, distributed proteomics database
PASSEL (SRM data)
PRIDE (MS/MS data)
MassIVE (MS/MS data)
Raw
ID/Q
Meta
Mandatory raw data deposition since July 2015
Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.orgVizcano et al., Nat Biotechnol, 2014
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
ProteomeXchange: A Global, distributed proteomics database
PASSEL (SRM data)
PRIDE (MS/MS data)
MassIVE (MS/MS data)
Raw
ID/Q
Meta
jPOST(MS/MS data)
Mandatory raw data deposition since July 2015
Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
Peptide Atlas Receiving repositories
PRIDE
Researchers results
Raw dataMetadata
PASSEL
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
10
ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
Peptide Atlas Receiving repositories
PRIDE
Researchers results
Raw dataMetadata
PASSEL
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
11
ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs Receiving repositories
PRIDE
GPMDB
Researchers results
Raw dataMetadata
PASSEL
proteomicsDB
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
OmicsDIIntegration with other omics datasets
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
12
Countries with at least 100 datasets: 1105 USA 546 Germany 411 United Kingdom 356 China 229 France 188 Netherlands 178 Canada 150 Switzerland 125 Australia 123 Spain 123 Denmark 117 Japan 101 Sweden
ProteomeXchange: 4,534 datasets up until 31st July, 2016Type: 4067 PRIDE (~90%) 339 MassIVE 115 PeptideAtlas/PASSEL 13 jPOSTPublicly Accessible: 2597 datasets, 57% of all 2334 PRIDE 135 MassIVE 115 PASSEL 13 jPOST
Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 1758 2016 (till end of July): 1184Top Species studied by at least 100 datasets:2010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus 936 reported taxa in total
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
PRIDE Archive (in the context of ProteomeXchange and the PSI standards)
How to submit data to PRIDE: PRIDE tools
How to access data in PRIDE Archive
Some examples of public data reuse
Overview
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs Receiving repositories
PRIDE
GPMDB
Researchers results
Raw dataMetadata
PASSEL
proteomicsDB
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
OmicsDIIntegration with other omics datasets
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
15
CompletePartialComplete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identificationprocessed results (results can be parsed) and they can be visualized.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Complete vs Partial submissions: experimental metadata
CompletePartialGeneral experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Partial submissions can be used to store other data typesEverything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets
PRIDE does not store SRM data (it goes to PASSEL)
Top down proteomics datasets.
Mass Spectrometry Imaging datasets.
Data independent acquisition techniques: e.g. SWATH-MS datasets.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016How to perform a complete PX submission to PRIDE
Decide between a complete/partial submission.
File conversion/export to mzIdentML (or PRIDE XML)
File check before submission (PRIDE Inspector)
Experimental annotation and actual file submission (PX submission tool)
Post-submission steps
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
PX Data workflow for MS/MS data
Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (also PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files:QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
20
PX Data workflow for MS/MS data
Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (also PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files (the list can be extended):QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
21
PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XML1
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Tools
RESULT file generationFinal RESULT file mzIdentML RESULTNow: native file export to mzIdentMLSpectra files(mzML, mzXML, mzData, mgf, pkl, ms2, dta, apl)MascotProteinPilotScaffoldPEAKSMSGF+PLGS
Native File export Others
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Put logo here23
Complete submissionsSearch Engine Results + MS filesSearch enginesmzIdentML
Mascot MSGF+ MyriMatch and related tools from D. Tabbs lab OpenMS PEAKS PeptideShaker ProCon (ProteomeDiscoverer, Sequest) Scaffold TPP via the idConvert tool (ProteoWizard) ProteinPilot (from version 5.0) X!Tandem native conversion (Beta, PILEDRIVER) Others: library for X!Tandem conversion, lab internal pipelines, Crux
An increasing number of tools support export to mzIdentML 1.1
Referenced spectral files need to be submitted as well (all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
24
Complete submissions: which tools are missingSearch Engine Results + MS filesSearch enginesmzIdentML
MaxQuant: Export to mzTab (work in progress)
Proteome Discoverer (Thermo): Work in progress
An increasing number of tools support export to mzIdentML 1.1
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
25
PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XML2
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016
PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
27
PRIDE Inspector Functionality
Summary and QC charts
Peptide spectra annotation and visualisationProtein groups inference
Protein view containing protein inference informationQuantification view Multiple export options (.mgf, protein/peptide tables, mzTab file)Direct access to PRIDE datasetsSummary and QC charts (Delta m/z, precursor charges, etc.)Spectra view (fragmentation table, ion series annotation)Protein inference algorithm and protein groups visualisation
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
28
PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XML3
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
Capture the mappings between the different types of files.
Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).
PX submission toolPublished RawOther files
http://www.proteomexchange.org/submissionPXsubmissiontool
Command line alternative: Using the Aspera file transfer protocol.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PX submission tool: screenshots
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line
- Up to 50X faster than FTP
File transfer speed should not be a problem!!
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Manuscript published detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:PXD000764- Title: Discovery of new CSF biomarkers for meningitis in children- 12 runs: 4 controls and 8 infected samples- Identification and quantification data
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Public data release: when does it happen?
When the author tells us to do it (the authors can do it by themselves)
When we find out that a dataset has been published
We look for PXD identifiers in PubMed abstracts.
If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!
New web form in the PRIDE web to facilitate the process
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
PRIDE Archive (in the context of ProteomeXchange and the PSI standards)
How to submit data to PRIDE: PRIDE tools
How to access data in PRIDE Archive
PRIDE Cluster and PRIDE Proteomes
Overview
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs Receiving repositories
PRIDE
GPMDB
Researchers results
Raw dataMetadata
PASSEL
proteomicsDB
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
OmicsDIIntegration with other omics datasets
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
36
ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016RSS and Twitter feeds for public datasets
http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml @proteomexchange
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Ways to access data in PRIDE Archive
PRIDE web interface
File repository
REST web service
PRIDE Inspector tool
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L: Proteomics 2011;11(5):996-9.https://github.com/compomics/searchguihttps://github.com/compomics/peptide-shakerVaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes H:Nature Biotechnology 2015; 33(1):22-24.
CompOmics Open Source Analysis Pipeline
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
Find the desired PRIDE project and start re-analyzing the data! inspect the project details .Reshake PRIDE data!
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
PRIDE Archive (in the context of ProteomeXchange and the PSI standards)
How to submit data to PRIDE: PRIDE tools
How to access data in PRIDE Archive
Some examples of public data reuse
Overview
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Datasets are being reused more and more.
Data download volume for PRIDE in 2015: ~ 200 TB
Vaudel et al., Proteomics, 2016
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
43
Challenges for data reuse in proteomicsInsufficient technical and biological metadata.
Large computational infrastructure maybe needed (e.g. when analysing many datasets together).
Shortage of expertise (people).
Lack of standardisation in the field.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Data sharing in Proteomics
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Data sharing in Proteomics
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Cluster
Provide an aggregated peptide centric view of PRIDE ArchiveHypothesis: same peptide will generate similar MS/MS spectra across experimentsNew spectral clustering algorithm to reliably group spectra coming from the same peptide Infer reliable identifications by comparing submitted identifications of spectra within a cluster
After clustering, a representative spectrum is built for all peptides consistently identified across different datasetsGriss et al., Nat. Methods, 2013Griss et al., Nat. Methods, 2016
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
47
Examples: one perfect cluster
880 PSMs give the same peptide ID4 species28 datasetsSame instruments
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Examples: one perfect cluster (2)
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Cluster as a Public Data Mining Resource50http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Data sharing in Proteomics
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Reprocess
Data are reprocessed with the intention of obtaining new knowledge or to provide an updated view on the results.
It mainly serves the same purpose of the original experiment.
For instance, a shot-gun dataset can be reprocessed with a different algorithm or an updated sequence database.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Reprocessing repositoriesThese resources collect MS raw data and reprocess it using one given analysis pipeline, and an up-to date protein sequence database.
Main resources: GPMDB and PeptideAtlas (ISB, Seattle).
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PeptideAtlas builds
Examples of builds:
- Human Human plasma Human urine Drosophila Mouse Mouse plasma Cow Yeast
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Kim et al., Nature, 2014
Two independent groups claimed to have produced the first complete draft of the human proteome by MS.
Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.
They used many different tissues.Nature cover 29 May 2014
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.
They complement that data with exotic tissues.
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016OmicsDI: Portal for omics datasetshttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB
ArrayExpressExpression Atlas
MetaboLightsMetabolomics WorkbenchGNPS
EGAPerez-Riverol et al., 2016, BioRXxiv
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.57
OmicsDI: Portal for omics datasets
Perez-Riverol et al., 2016, BioRXxiv
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.58
Main characteristics of PRIDE and ProteomeXchange
PX/PRIDE submission workflow for MS/MS dataPRIDE InspectorPX submission tool
PRIDE/ProteomeXchange has become the de facto standard for data submission and data availability in proteomics
Reuse/ reanalysis of proteomics data -> Many possible applications
Conclusions
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)
Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak
Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 201660
PSI Spring Meeting 2017
Beijing Proteome Research Center, ChinaApril 24-26, 2017April 23: 2nd PHOENIX Mini-Symposium on Frontiers of ProteomicsApril 27: Hiking the Great Wall
Focus topics:Quality control: qcMLProteogenomics formatsproXI: proteomics eXpression InterfacePrivacy and Proteomics Data
Juan A. [email protected] Summer School 2016Dagstuhl, 27 September 2016
61
PXD identifierHits/ No files = dataset downloadsDataset Title
PXD00056146578/ 2383 = 20A draft map of the human proteome
PXD00158713435/140 = 96
DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics
PRD00006612748/4090 = 3
Quantitative Proteomics Analysis of the Secretory Pathway
PXD0006584004/460 = 9
Global phosphoproteomic profiling reveals distinct signatures in B-cell non-Hodgkin
PXD0001493781/598 = 6The potato tuber mitochondrial proteome
PXD00086512535/1368 = 9Mass spectrometry based draft of the human proteome