INTRODUCTION
We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica-tion studies on MALDI-TOF data based on this pipeline are presented.
REFERENCES
[1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006.04.011
[2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554
[3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005.
[4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946
[5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004.
Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics
Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy
MGED MGED 99
September 7-10, 2006 Seattle, WA,
U.S.A.
DATASETS
D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5]
• 49 samples (24 diseased + 25 controls)• Each raw sample has 56384 m/z
measurements (892 KB)• Each preprocessed sample has
564 m/z measurements (19 KB)• Preprocessing:
• Normalization• Binning
• Biomarker identification• Baseline subtraction• Peak Alignment – Clustering• 67 features identified
D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical
replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks
MS-ANALYZER
MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services:
• Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative.
• Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing).
• Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2].
• Sharing of experiments data, workflows and knowledge
WS
RSR PPSRPSR
raw spectra
pre-processedspectra
preparedspectra
SpecDB APIs
Ontology-based Workflow Designer
Ontology Assistant- browsing- querying
WF Editor-composition-browsing-selection-visualization
WF SchemaAbstract,
Concrete WF
ResourceDiscoveryServices
WF Translator
WF Scheduler
WF Monitor
Workflow Scheduler
Ontology manager
Ontologies
UDDI/MDS
MetadataWSDL
WS1
WS2
Spectra Management
Services
Network
WS1
WS2
Spectra Visualization
Services
WS1
WS2
Spectra Preparation
Services
WS1
WS2
Spectra Preprocessing
Services
11
M-WS
Ontology-based Workflow Designer
BIODcv WS
BioDCV WSfront-end
Server
FTP repositoryFTP repository
• Data• Metadata
• Repository URL• email
• DMZ Server
Apachemod_Python ZSI module
BIODCV
The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3].
For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system.
BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4].
FEATUREEXTRACTION
• Within sample
• across sample
Complete Validation
R scripts
• visualizationATE, sampletracking
PHP
• biomarker lists
• HTML publication
• Biomarkers data• REPORT
ACKNOWLEDGMENTS
• ITC-irst: R Flor, D Albanese, B Irler • UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T
Mazza
Three Internet Web Services are used to integrate remotely the two main system components.
The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network.
This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area.
The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email.
WEB SERVICESARCHITECTURE
n
AT
E
10
20
30
40
1 5 10 15 20 30 40 50 67
Number of features
E(S
)
0.0
0.5
1.0
1 5 50n1
1: S0 (26)
1 5 50n1
2: S1 (28)
1 5 50n1
3: S2 (27)
1 5 50n1
4: S3 (25)
1 5 50n1
5: S4 (26)
0.0
0.5
1.0
1 5 50n1
6: S5 (35)
1 5 50n1
7: S6 (19)
1 5 50n1
8: S7 (32)
1 5 50n1
9: S8 (31)
1 5 50n1
10: S9 (30)
0.0
0.5
1.0
1 5 50n1
11: S10 (24)
1 5 50n1
12: S11 (22)
1 5 50n1
13: S12 (22)
1 5 50n1
14: S13 (24)
1 5 50n1
15: S14 (20)
0.0
0.5
1.0
1 5 50n1
16: S15 (27)
1 5 50n1
17: S16 (24)
1 5 50n1
18: S17 (22)
1 5 50n1
19: S18 (26)
1 5 50n1
20: S19 (18)
0.0
0.5
1.0
1 5 50n1
21: S20 (27)
1 5 50n1
22: S21 (25)
1 5 50n1
23: S22 (19)
1 5 50n1
24: S23 (21)
1 5 50n1
25: S24 (23)
Error rate (tumour tissue)
Error rate (non- tumoural tissue)
No-information error rate
11
The BioDCV system: EGEE BioMed VO
2-50 MB
50-400 MB
grid-ftp
scpgrid-ftp
grid-ftp
grid-ftp
scp
Commands:1.grid-url-copy/lcg-cp db from local to SE2.edg-job-submit BioDCV.jdl3.grid-url-copy/lcg-cp db from SE to local
D2: mean A
m/z
Inte
nsity
9100 9120 9140 9160 9180 9200
01
000
200
03
000
400
0 D2: .95 Student bootstrap CI
D2: mean B
D2: .95 Student bootstrap CI
9133,17 Da
Top Related