ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...

37
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK

description

Talk I gave in the Human Proteome Project session during HUPO 2014, devoted to Proteomexchange. I summarized the updated in the last year.

Transcript of ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...

Page 1: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Dr. Juan Antonio Vizcaíno

PRIDE Group Coordinator

Proteomics Services Team

EMBL-EBI

Hinxton, Cambridge, UK

Page 2: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Overview

• The ProteomeXchange (PX) consortium

• How to submit and access data in PX via PRIDE

• How to access PX data

• Some HPP related things

Page 3: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

ProteomeXchange Consortium•Goal: Development of a framework to allow

standard data submission and dissemination pipelines between the main existing proteomics repositories.

•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).

•Common identifier space (PXD identifiers)

•Two supported data workflows: MS/MS and SRM.

•Main objective: Make life easier for researchers

http://www.proteomexchange.org

Page 4: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 5: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

MassIVE (UCSD)

http://proteomics.ucsd.edu/service/massive/

• Just joined ProteomeXchange on June 2014• Only partial submissions. A few datasets so far.

Page 6: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

• Suitable for SRM assays

• Part of PeptideAtlas set of resources.

http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012

PASSEL: repository for SRM data

Page 7: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 8: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Overview

• The ProteomeXchange (PX) consortium

• How to submit and access data in PX via PRIDE

• How to access PX data

• Some HPP related things

Page 9: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Manuscript just out detailing the process

Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission

Example dataset:PXD000764

- Title: “Discovery of new CSF biomarkers for meningitis in children”- 12 runs: 4 controls and 8 infected samples- Identification and quantification data

Page 10: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or

peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.

3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.

4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type

Published

RawFiles

Other files

Page 11: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete vs Partial submissions: experimental metadata

Complete Partial

General experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed

Page 12: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete

Partial

Complete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identification

processed results and they can be visualized.

Page 13: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Complete submissions using mzIdentML

Search Engine

Results + MS files

Search engines

mzIdentML

- Mascot- MSGF+- Myrimatch and related tools from D. Tabb’s

lab- OpenMS- PEAKS- ProCon (ProteomeDiscoverer, Sequest)- Scaffold- TPP via the idConvert tool (ProteoWizard)- ProteinPilot (planned by the end of 2014)- Others: library for X!Tandem conversion, lab

internal pipelines, …

An increasing number of tools support export to mzIdentML 1.1

- Referenced spectral files need to be submitted as well (all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

Page 14: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Tools ‘RESULT’ file generation Final ‘RESULT’ file

mzIdentML ‘RESULT’

Now: native file export

Spectra files

Mascot

ProteinPilot

Scaffold

PEAKS

MSGF+

Others

Native File export

Page 15: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Available for complete submissions

Wang et al., Nat. Biotechnology, 2012

PRIDE Inspector 2.0

PRIDE Inspector 2.0 supports:

- PRIDE XML- mzIdentML + all types of spectra files- mzML- mzTab Ident (work in progress)

http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector

Page 16: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

•Capture the mappings between the different types of files.

•Add the mandatory metadata annotation.

•Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).

•Command line alternative: some scripting is needed.

PX submission tool: data submission

Published

Raw

Other files

http://www.proteomexchange.org/submission

PXsubmission

tool

Page 17: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Uploading large datasets: Aspera

- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line

- Up to 50X faster than FTP File transfer speed should not be a problem!!

Page 18: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Origin: 271 USA

166 Germany

115 United Kingdom

73 Switzerland

70 China

68 Netherlands

67 France

55 Canada

44 Spain

42 Belgium

33 Sweden

31 Australia

31 Denmark

31 Japan

20 India

20 Norway

19 Taiwan

17 Ireland

16 Austria

14 Finland

14 Italy

12 Republic of Korea

11 Brazil

9 Russia

8 Israel

7 Singapore …

ProteomeXchange: 1329 datasets up until October 2014

Type:

437 PRIDE complete

792 PRIDE partial

63 PeptideAtlas/PASSEL complete

14 MassIVE

23 reprocessed

Publicly Accessible:

691 datasets, 52% of all

86% PRIDE

12% PASSEL

2% MassIVE

Data volume:

Total: ~55 TB

Number of all files: ~131,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets:

577 Homo sapiens

165 Mus musculus

56 Saccharomyces cerevisiae

53 Arabidopsis thaliana

29 Rattus norvegicus

22 Escherichia coli

17 Bos taurus

16 Mycobacterium tuberculosis

13 Oryza sativa

13 Drosophila melanogaster

13 Glycine max

~ 290 species in total

Datasets/year:

2012: 102

2013: 527

2014: 700

Page 19: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Public data release: when does it happen?• When the author tells us to do it (the authors can do it

by themselves)

• When we find out that a dataset has been published

• We look for PXD identifiers in PubMed abstracts.

• If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!

• New web form in the PRIDE web to facilitate the process

Page 20: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Origin: 271 USA

166 Germany

115 United Kingdom

73 Switzerland

70 China

68 Netherlands

67 France

55 Canada

44 Spain

42 Belgium

33 Sweden

31 Australia

31 Denmark

31 Japan

20 India

20 Norway

19 Taiwan

17 Ireland

16 Austria

14 Finland

14 Italy

12 Republic of Korea

11 Brazil

9 Russia

8 Israel

7 Singapore …

ProteomeXchange: 1329 datasets up until October 2014

Type:

437 PRIDE complete

792 PRIDE partial

63 PeptideAtlas/PASSEL complete

14 MassIVE

23 reprocessed

Publicly Accessible:

691 datasets, 52% of all

86% PRIDE

12% PASSEL

2% MassIVE

Data volume:

Total: ~55 TB

Number of all files: ~131,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 10 datasets:

577 Homo sapiens

165 Mus musculus

56 Saccharomyces cerevisiae

53 Arabidopsis thaliana

29 Rattus norvegicus

22 Escherichia coli

17 Bos taurus

16 Mycobacterium tuberculosis

13 Oryza sativa

13 Drosophila melanogaster

13 Glycine max

~ 290 species in total

Datasets/year:

2012: 102

2013: 527

2014: 700

Page 21: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Partial submissions can be used to store other data types

• Everything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets

• PRIDE does not store SRM data (it goes to PASSEL)

• Top down proteomics datasets.

• Mass Spectrometry Imaging datasets.

• Data independent acquisition techniques: e.g. SWATH-MS datasets.

Page 22: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Imaging MS datasets: partial submissions

C

D

From original publication [13] Reconstructed ProteomeXchange data

1. Thermo RAW data / UDP2. Mirion Software (JLU)

1. Thermo RAW data / UDP2. Convert to imzML3. Upload to PRIDE repository

(EBI, Cambridge, UK)

4. Download from PRIDE5. Display in MSiReader

- Vendor-independent data format- Freely available software (open source)- ‚open data‘ – free to reuse- Anybody can do this!

PRIDE DatabaseEuropean

Bioinformatics Institute,

Cambridge, UK

4. Download

No file size limit!

3. Upload

Page 23: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Overview

• The ProteomeXchange (PX) consortium

• How to submit and access data in PX via PRIDE

• How to access PX data

• Some HPP related things

Page 24: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

ProteomeCentral

Metadata / Manuscript

Raw Data*

Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL (SRM data)

PRIDE (MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE (MS/MS data)

Vizcaíno et al., Nat Biotechnol, 2014

ProteomeXchange data workflow

Page 25: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

ProteomeCentral: Portal for all PX datasets

http://proteomecentral.proteomexchange.org/cgi/GetDataset

Page 26: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Get notified about new PX datasets

- Subscribe to the RSS Feed to receive information about the new datasets:

http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml

Proteome Central Researchers

Page 27: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Overview

• The ProteomeXchange (PX) consortium

• How to submit and access data in PX via PRIDE

• How to access PX data

• Some HPP related things

Page 28: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

PX submission tool: HPP tags

Page 29: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

HPP datasets are now taggedThe Projects are now tagged and can be browsed as a group of data sets.

Tags for: HPP, C-HPP and B/D-HPP

Page 30: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

HPP PX datasets: some numbersSince January 2014, we started capturing the PI information

- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP

- Countries represented in C-HPP:- 5 Spain- 4 South Korea- 3 Brazil, China

Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange

Page 31: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Which are the most accessed datasets?

PXD Identifier Hits Dataset title Publication

PXD000561 153512 A draft map of the human proteome

Kim et al., Nature,2014.

PMID: 24870542

PXD000851 111587Membrane proteomic analysis of

colorectal cancer tissue

Kume et al., MCP, 2014.

PMID:24687888

PXD000865 51639Mass spectrometry based draft of

the human proteome

Wilhelm et al., 2014, Nature,

PMID:24870543

Page 32: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Which are the most accessed datasets?To

tal N

umbe

rs

Page 33: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Conclusions• ProteomeXchange is widely used.

• PRIDE contains most of the MS/MS datasets.

• It has now a new consortium member: MassIVE (UCSD).

• Around half of the datasets are already public.

• Different open source tools available to facilitate the process:

• File transfer speed should not be a problem (Aspera support)

• Data depostion enables and promotes data reuse.

• ProteomeXchange is open to new members.

Page 34: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Acknowledgements

PRIDE Team

Attila CsordasRui WangFlorian ReisingerJose A. DianesTobias TernentYasset Perez-RiverolNoemi del Toro

Henning Hermjakob

EU FP7 grant number 260558

PeptideAtlas Team (ISB, Seattle)Eric DeutschTerry FarrahZhi Sun

Andrew R. JonesLennart MartensJuan Pablo AlbarMartin EisenacherGil OmennNuno Bandeira

And many other PX partners and stakeholders

Page 35: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Questions?

Page 36: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Connecting different data types

It can be used for:

- ArrayExpress/ GEOIdentifiers

- MetaboLights identifiers

- etc, etc

How to connect different data types (genomics, metabolomics, etc)?

Page 37: ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Juan A. Vizcaí[email protected]

13th HUPO World CongressMadrid, 5 October 2014

Pilot project started in the context of ELIXIR

43

CSC

BILS

Site B

Site C

EUDAT CDIELIXIR

B2SAFE

B2SAFE

B2SAFE

B2SAFE

PRIDEEMBL-EBI