ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...
-
Upload
juan-antonio-vizcaino -
Category
Science
-
view
103 -
download
2
description
Transcript of ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data...
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
ProteomeXchange Consortium•Goal: Development of a framework to allow
standard data submission and dissemination pipelines between the main existing proteomics repositories.
•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).
•Common identifier space (PXD identifiers)
•Two supported data workflows: MS/MS and SRM.
•Main objective: Make life easier for researchers
http://www.proteomexchange.org
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Just joined ProteomeXchange on June 2014• Only partial submissions. A few datasets so far.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
• Suitable for SRM assays
• Part of PeptideAtlas set of resources.
http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Manuscript just out detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”- 12 runs: 4 controls and 8 infected samples- Identification and quantification data
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.
4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type
Published
RawFiles
Other files
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete
Partial
Complete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete submissions using mzIdentML
Search Engine
Results + MS files
Search engines
mzIdentML
- Mascot- MSGF+- Myrimatch and related tools from D. Tabb’s
lab- OpenMS- PEAKS- ProCon (ProteomeDiscoverer, Sequest)- Scaffold- TPP via the idConvert tool (ProteoWizard)- ProteinPilot (planned by the end of 2014)- Others: library for X!Tandem conversion, lab
internal pipelines, …
An increasing number of tools support export to mzIdentML 1.1
- Referenced spectral files need to be submitted as well (all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML ‘RESULT’
Now: native file export
Spectra files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Available for complete submissions
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML- mzIdentML + all types of spectra files- mzML- mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
•Capture the mappings between the different types of files.
•Add the mandatory metadata annotation.
•Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).
•Command line alternative: some scripting is needed.
PX submission tool: data submission
Published
Raw
Other files
http://www.proteomexchange.org/submission
PXsubmission
tool
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line
- Up to 50X faster than FTP File transfer speed should not be a problem!!
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Origin: 271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
ProteomeXchange: 1329 datasets up until October 2014
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Datasets/year:
2012: 102
2013: 527
2014: 700
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Public data release: when does it happen?• When the author tells us to do it (the authors can do it
by themselves)
• When we find out that a dataset has been published
• We look for PXD identifiers in PubMed abstracts.
• If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!
• New web form in the PRIDE web to facilitate the process
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Origin: 271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
ProteomeXchange: 1329 datasets up until October 2014
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Datasets/year:
2012: 102
2013: 527
2014: 700
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Partial submissions can be used to store other data types
• Everything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets.
• Mass Spectrometry Imaging datasets.
• Data independent acquisition techniques: e.g. SWATH-MS datasets.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Imaging MS datasets: partial submissions
C
D
From original publication [13] Reconstructed ProteomeXchange data
1. Thermo RAW data / UDP2. Mirion Software (JLU)
1. Thermo RAW data / UDP2. Convert to imzML3. Upload to PRIDE repository
(EBI, Cambridge, UK)
4. Download from PRIDE5. Display in MSiReader
- Vendor-independent data format- Freely available software (open source)- ‚open data‘ – free to reuse- Anybody can do this!
PRIDE DatabaseEuropean
Bioinformatics Institute,
Cambridge, UK
4. Download
No file size limit!
3. Upload
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about the new datasets:
http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml
Proteome Central Researchers
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
PX submission tool: HPP tags
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
HPP datasets are now taggedThe Projects are now tagged and can be browsed as a group of data sets.
Tags for: HPP, C-HPP and B/D-HPP
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
HPP PX datasets: some numbersSince January 2014, we started capturing the PI information
- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP:- 5 Spain- 4 South Korea- 3 Brazil, China
Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Which are the most accessed datasets?
PXD Identifier Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteome
Kim et al., Nature,2014.
PMID: 24870542
PXD000851 111587Membrane proteomic analysis of
colorectal cancer tissue
Kume et al., MCP, 2014.
PMID:24687888
PXD000865 51639Mass spectrometry based draft of
the human proteome
Wilhelm et al., 2014, Nature,
PMID:24870543
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Which are the most accessed datasets?To
tal N
umbe
rs
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Conclusions• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Acknowledgements
PRIDE Team
Attila CsordasRui WangFlorian ReisingerJose A. DianesTobias TernentYasset Perez-RiverolNoemi del Toro
Henning Hermjakob
EU FP7 grant number 260558
PeptideAtlas Team (ISB, Seattle)Eric DeutschTerry FarrahZhi Sun
Andrew R. JonesLennart MartensJuan Pablo AlbarMartin EisenacherGil OmennNuno Bandeira
And many other PX partners and stakeholders
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Connecting different data types
It can be used for:
- ArrayExpress/ GEOIdentifiers
- MetaboLights identifiers
- etc, etc
How to connect different data types (genomics, metabolomics, etc)?
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Pilot project started in the context of ELIXIR
43
CSC
BILS
Site B
Site C
EUDAT CDIELIXIR
B2SAFE
B2SAFE
B2SAFE
B2SAFE
PRIDEEMBL-EBI