ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Dr. Juan Antonio Vizcaíno

PRIDE Group Coordinator

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Overview

• The ProteomeXchange (PX) consortium

• How to submit and access data in PX via PRIDE

• How to access PX data

• Some HPP related things



ProteomeXchange Consortium•Goal: Development of a framework to allow

standard data submission and dissemination pipelines between the main existing proteomics repositories.

•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).

•Common identifier space (PXD identifiers)

•Two supported data workflows: MS/MS and SRM.

•Main objective: Make life easier for researchers

http://www.proteomexchange.org

http://www.proteomexchange.org/





ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow



MassIVE (UCSD)

http://proteomics.ucsd.edu/service/massive/

• Just joined ProteomeXchange on June 2014• Only partial submissions. A few datasets so far.






• Suitable for SRM assays

• Part of PeptideAtlas set of resources.

http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012

PASSEL: repository for SRM data

http://www.peptideatlas.org/passel/





ProteomeCentral


Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs


PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB


Reprocessed results

Raw data*

Metadata






Overview







Manuscript just out detailing the process

Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission

Example dataset:PXD000764

- Title: “Discovery of new CSF biomarkers for meningitis in children”- 12 runs: 4 controls and 8 infected samples- Identification and quantification data

http://www.proteomexchange.org/submission



PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type

Published

RawFiles

Other files



Complete vs Partial submissions: experimental metadata

Complete Partial

General experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed



Complete

Partial

Complete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identification

processed results and they can be visualized.



Complete submissions using mzIdentML

Search Engine

Results + MS files

Search engines

mzIdentML

- Mascot- MSGF+- Myrimatch and related tools from D. Tabb’s

lab- OpenMS- PEAKS- ProCon (ProteomeDiscoverer, Sequest)- Scaffold- TPP via the idConvert tool (ProteoWizard)- ProteinPilot (planned by the end of 2014)- Others: library for X!Tandem conversion, lab

internal pipelines, …

An increasing number of tools support export to mzIdentML 1.1

- Referenced spectral files need to be submitted as well (all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

http://www.psidev.info/tools-implementing-mzidentml



Tools ‘RESULT’ file generation Final ‘RESULT’ file

mzIdentML ‘RESULT’

Now: native file export

Spectra files

Mascot

ProteinPilot

Scaffold

PEAKS

MSGF+

Others

Native File export



Available for complete submissions

Wang et al., Nat. Biotechnology, 2012

PRIDE Inspector 2.0

PRIDE Inspector 2.0 supports:

- PRIDE XML- mzIdentML + all types of spectra files- mzML- mzTab Ident (work in progress)

http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector





•Capture the mappings between the different types of files.

•Add the mandatory metadata annotation.

•Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).

•Command line alternative: some scripting is needed.

PX submission tool: data submission

Published

Raw

Other files


PXsubmission

tool






Uploading large datasets: Aspera

- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line

- Up to 50X faster than FTP File transfer speed should not be a problem!!



Origin: 271 USA

166 Germany

115 United Kingdom

73 Switzerland

70 China

68 Netherlands

67 France

55 Canada

44 Spain

42 Belgium

33 Sweden

31 Australia

31 Denmark

31 Japan

20 India

20 Norway

19 Taiwan

17 Ireland

16 Austria

14 Finland

14 Italy

12 Republic of Korea

11 Brazil

9 Russia

8 Israel

7 Singapore …

ProteomeXchange: 1329 datasets up until October 2014

Type:

437 PRIDE complete

792 PRIDE partial

63 PeptideAtlas/PASSEL complete

14 MassIVE

23 reprocessed

Publicly Accessible:

691 datasets, 52% of all

86% PRIDE

12% PASSEL

2% MassIVE

Data volume:

Total: ~55 TB

Number of all files: ~131,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets:

577 Homo sapiens

165 Mus musculus

56 Saccharomyces cerevisiae

53 Arabidopsis thaliana

29 Rattus norvegicus

22 Escherichia coli

17 Bos taurus

16 Mycobacterium tuberculosis

13 Oryza sativa

13 Drosophila melanogaster

13 Glycine max

~ 290 species in total

Datasets/year:

2012: 102

2013: 527

2014: 700



Public data release: when does it happen?• When the author tells us to do it (the authors can do it

by themselves)

• When we find out that a dataset has been published

• We look for PXD identifiers in PubMed abstracts.

• If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!

• New web form in the PRIDE web to facilitate the process



Origin: 271 USA

166 Germany

115 United Kingdom

73 Switzerland

70 China

68 Netherlands

67 France

55 Canada

44 Spain

42 Belgium

33 Sweden

31 Australia

31 Denmark

31 Japan

20 India

20 Norway

19 Taiwan

17 Ireland

16 Austria

14 Finland

14 Italy

12 Republic of Korea

11 Brazil

9 Russia

8 Israel

7 Singapore …

ProteomeXchange: 1329 datasets up until October 2014

Type:

437 PRIDE complete

792 PRIDE partial

63 PeptideAtlas/PASSEL complete

14 MassIVE

23 reprocessed

Publicly Accessible:

691 datasets, 52% of all

86% PRIDE

12% PASSEL

2% MassIVE

Data volume:

Total: ~55 TB

Number of all files: ~131,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets:

577 Homo sapiens

165 Mus musculus

56 Saccharomyces cerevisiae

53 Arabidopsis thaliana

29 Rattus norvegicus

22 Escherichia coli

17 Bos taurus

16 Mycobacterium tuberculosis

13 Oryza sativa

13 Drosophila melanogaster

13 Glycine max

~ 290 species in total

Datasets/year:

2012: 102

2013: 527

2014: 700



Partial submissions can be used to store other data types

• Everything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets

• PRIDE does not store SRM data (it goes to PASSEL)

• Top down proteomics datasets.

• Mass Spectrometry Imaging datasets.

• Data independent acquisition techniques: e.g. SWATH-MS datasets.



Imaging MS datasets: partial submissions

C

D

From original publication [13] Reconstructed ProteomeXchange data

1. Thermo RAW data / UDP2. Mirion Software (JLU)

1. Thermo RAW data / UDP2. Convert to imzML3. Upload to PRIDE repository

(EBI, Cambridge, UK)

4. Download from PRIDE5. Display in MSiReader

- Vendor-independent data format- Freely available software (open source)- ‚open data‘ – free to reuse- Anybody can do this!

PRIDE DatabaseEuropean

Bioinformatics Institute,

Cambridge, UK

4. Download

No file size limit!

3. Upload



Overview







ProteomeCentral


Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs


PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB


Reprocessed results

Raw data*

Metadata






ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset





Get notified about new PX datasets

- Subscribe to the RSS Feed to receive information about the new datasets:

http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml

Proteome Central Researchers





Overview







PX submission tool: HPP tags



HPP datasets are now taggedThe Projects are now tagged and can be browsed as a group of data sets.

Tags for: HPP, C-HPP and B/D-HPP



HPP PX datasets: some numbersSince January 2014, we started capturing the PI information

- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP

- Countries represented in C-HPP:- 5 Spain- 4 South Korea- 3 Brazil, China

Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange



Which are the most accessed datasets?

PXD Identifier Hits Dataset title Publication

PXD000561 153512 A draft map of the human proteome

Kim et al., Nature,2014.

PMID: 24870542

PXD000851 111587Membrane proteomic analysis of

colorectal cancer tissue

Kume et al., MCP, 2014.

PMID:24687888

PXD000865 51639Mass spectrometry based draft of

the human proteome

Wilhelm et al., 2014, Nature,

PMID:24870543



Which are the most accessed datasets?To

tal N

umbe

rs



Conclusions• ProteomeXchange is widely used.

• PRIDE contains most of the MS/MS datasets.

• It has now a new consortium member: MassIVE (UCSD).

• Around half of the datasets are already public.

• Different open source tools available to facilitate the process:

• File transfer speed should not be a problem (Aspera support)

• Data depostion enables and promotes data reuse.

• ProteomeXchange is open to new members.



Acknowledgements

PRIDE Team

Attila CsordasRui WangFlorian ReisingerJose A. DianesTobias TernentYasset Perez-RiverolNoemi del Toro

Henning Hermjakob

EU FP7 grant number 260558

PeptideAtlas Team (ISB, Seattle)Eric DeutschTerry FarrahZhi Sun

Andrew R. JonesLennart MartensJuan Pablo AlbarMartin EisenacherGil OmennNuno Bandeira

And many other PX partners and stakeholders



Questions?



Connecting different data types

It can be used for:

- ArrayExpress/ GEOIdentifiers

- MetaboLights identifiers

- etc, etc

How to connect different data types (genomics, metabolomics, etc)?



Pilot project started in the context of ELIXIR

43

CSC

BILS

Site B

Site C

EUDAT CDIELIXIR

B2SAFE

B2SAFE

B2SAFE

B2SAFE

PRIDEEMBL-EBI

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...

Science

Transcript of ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...