Do we need to make public our proteomics data?
-
Upload
yasset-riverol -
Category
Science
-
view
396 -
download
1
Transcript of Do we need to make public our proteomics data?
Do we need to make public our proteomics
data?
Dr. Yasset Perez-Riverol
Twitter: @ypriverol
Github: ypriverol
Bioinformatician - PRIDE Group
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
I believe in open data, source, access …
An Integrated, Directed Mass Spectrometric Approach for In-depth
Characterization of Complex Peptide Mixtures. Mol Cell Proteomics.
Nov 2008; 7(11): 2138–2150
1 dataset (no cost) => 4 papers and 3 new
algorithms
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
I believe in open data, source, access policies…
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Overview
• Proteomics data deposition, bad practices and
experiences.
• PRIDE and ProteomeXchange
• PRIDE Components.
• Ongoing and Future work!
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Proteomics data deposition, bad practices,
experiences.
Protein Expression Databases
Processed Data
RAW Data
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and MassIVE (UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
http://www.proteomexchange.org
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
ProteomeXchange Partners: MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Just joined ProteomeXchange on June 2014
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
• Suitable for SRM assays
• Part of PeptideAtlas set of
resources.
http://www.peptideatlas.org/passel/Farrah et al., Proteomics, 2012
ProteomeXchange Partners: PASSEL for SRM data
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
ProteomeXchange Partners: Pride
Vizcaíno et al., N. A Research, 2014
http://www.ebi.ac.uk/pride/archive/
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Current status of databases & repositories.
Protein resources
Protein Expression Databases
Processed Data & RAW DataPRIDE PASSEL
Chorus
MassIVE
Perez-Riverol Y, et al.
Proteomics. 2014
PeptideAtlas
GPMDB
proteomicsDBPaxDb
Human Proteinpedia
MaxQBPRIDE
PASSEL
Human Proteome Map
MOPED
UniProt
neXtProt
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
PX Submission workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
b. Complete submissions: Result files can be converted
to PRIDE XML or the mzIdentML data standard.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter based on Ontologies and
Controlled Vocabularies.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files
c. OTHER: Any other file type
Published
RawFiles
Other files
Ternent et al., Proteomics, 2014
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Complete submissions (mzIdentML)
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
An increasing number of tools support export to mzIdentML
1.1
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-
mzIdentML#.
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Universal file format (mzTab)
http://mztab.googlecode.com
• Basic information about experiment and sample
• Key-Value pairsMetadata
• Basic information about protein identifications
• Table-basedProtein
• Information about quantified peptides
• Table-basedPeptide
• Information about identified spectra
• Table-basedPSM
• Basic information about identified small molecules
• Table-basedSmall Molecule
J. Griss et al., MCP, 2014
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
PRIDE Components: Submission Process
PRIDE Converter PRIDE Inspector PX Submission Tool
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
PRIDE Components: PX submission tool
Published
Raw
Other files
http://www.proteomexchange.org/submission
PX
submission
tool
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Available for complete submissions
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML- mzTab Quantitation (work in progress)
https://github.com/PRIDE-Toolsuite/
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Pride Components: Services & Web components
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
ProteomeXchange: 1329 datasets up until October 2014
Origin:
293 USA
184 Germany
143 UK
83 France
82 Netherlands
78 China
62 Switzerland
46 Spain
45 Belgium
45 Canada
42 Denmark
37 Australia
37 Japan
34 Sweden
26 Austria
21 Brazil
21 Taiwan
21 India
20 Norway
19 Finland
17 Ireland
14 Italy
12 Republic of Korea
8 Israel 9 Singapore
8 Russia
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Datasets/year:
2012: 102
2013: 527
2014: 700
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Brazil Submissions:
21 Projects
11 PXD Public
10 PXD Private
Main Contributors:
Martins-de-Souza D. PhD (6)
Domont G. Prof (4)
Carvalho PC. (4)
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Journals and Data Deposition
Journal
Nu
mb
er
of S
ub
mis
sio
ns
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Data Access ?Tota
l N
um
bers
PXD Identifier Hits Dataset title
PXD000561 153512 A draft map of the human proteome
PXD000865 51639 Mass spectrometry based draft of the human proteome
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Ongoing and future work.
• Quality assessment of complete submissions.
• Make the data reusable and reusable.
• Integration of different Protein expression resources
• PRIDE
• PeptideAtlas
• ProteomicsDB
• Human Proteome Map
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
QC with PRIDE Inspector
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
QC with PRIDE Inspector
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
QC with PRIDE Inspector
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
QC PRIDE Inspector and Quantitation
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Validation of controversial data
• Analysis of Tyrannosaurus rex fossils: controversial presence of
collagen (is it a contamination of the sample?)
Asara et al. (2007) Science 316: 280-5.
Asara et al. (2007) Science 316: 1324-5.
Bern et al. (2009) JPR 9: 4328-32PRIDE assay accession 8633
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Quality control: PRIDE Cluster
• Data integration across many experiments before filtering
• Assumption: The same peptide will generate the same MS/MS
spectrum in many experiments
• Cluster all spectra in PRIDE
• Those clusters which contain only/mainly one peptide are considered
reliable
NMMAACDPR
NMMAACDPR
PPECPDFDPPRNMMAACDPR
Consensus
PPECPDFDPPR
Griss, et. al. Nature Met. 2012
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Spectral libraries
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Sneak peak of the new PRIDE Cluster web
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Make data available and reusable.
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
Perez-Riverol Y, et al.
Proteomics. 2014
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Vaudel M, Barsnes H, Berven FS, Sickmann A,
Martens L:
Proteomics 2011;11(5):996-9.
http://searchgui.googlecode.com http://peptide-shaker.googlecode.com
Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L,
Barsnes H:
Nature Biotechnology (in press)
CompOmics Open Source Analysis Pipeline
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Find the desired PRIDE project …
… and start re-analyzing the data!
… inspect the project details ….
Reshake PRIDE data!
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Current status of databases & repositories.
Protein resources
Protein Expression Databases
Processed Data & RAW DataPRIDE PASSEL
Chorus
MassIVE
Perez-Riverol Y, et al.
Proteomics. 2014
PeptideAtlas
GPMDB
proteomicsDBPaxDb
Human Proteinpedia
MaxQBPRIDE
PASSEL
Human Proteome Map
MOPED
UniProt
neXtProt
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
PROXI Clients
Repositories
&
Databases
Web Services PROXI PROXI PROXI PROXIPROXI
Registry
Data
Perez-Riverol Y, Proteomics, 20014
Integration of different Protein expression
resources
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Conclusions
• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Yasset Perez-Riverol [email protected]
BRPROT 2014Búzios, Brazil (Dec 7-10, 2014)
Acknowledgements
PRIDE Team
Juan A. Vizcaino (Group Leader)
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Noemi del Toro
Henning Hermjakob
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
MAssIVE
Nuno Bandeira
And many other PX partners and
stakeholders