Created with MindGenius Business 2005® Mass Spectrometry Mass Spectrometry.
Development and Evaluation of a Prototype System for Automated Analysis of Clinical Mass...
-
Upload
wilfred-harris -
Category
Documents
-
view
219 -
download
2
Transcript of Development and Evaluation of a Prototype System for Automated Analysis of Clinical Mass...
Development and Evaluation of a Prototype System for Automated Analysis of Clinical Mass
Spectrometry Data
Nafeh Fananapazir
Master’s Thesis DefenseFebruary 27th, 2006
Outline
1. Backgrounda. Mass spectrometry in contextb. Overview of MS in the clinical domainc. Challenges in MS analysis
2. Development of an automated system (FAST-AIMS)a. Requirements for an automated systemb. Overview of FAST-AIMSc. Technical description of FAST-AIMS
3. Evaluation of FAST-AIMSa. Evaluation one: multiple user studyb. Evaluation two: multiple dataset study
4. Conclusions
1. Background
Mass spectrometry in context Analytical tool for measuring molecular weight of sample
components Used in the measurement of biological samples
lipids, complex carbohydrates, nucleic acids, peptides, proteins Resolution on the order of 0.01% of total molecular weight
Small samples: small organic molecules measured at ppm Large samples: within 4 daltons of a 40,000 dalton sample
Uses Resolution: determination of sample purity, detection of aa
substitutions, detection of post-trans modifications Reaction monitoring: enzyme reactions, protein digestion Sequencing: SEQUEST/SALSA
1. Background
Mass spectrometry in context Components of MS analysis
1. Sample isolation (2DE, biopsy, serum)
2. Sample processing (proteolytic digestion, LC)
3. Ionization (soft: molecules intact; hard: molecules fragment)a. ESI (electron spray ionization)
• evaporation of charged, aerosolized droplets under vacuum
• voltage can be raised, increasing fragmentation
b. MALDI (matrix assisted laser desorption ionization)• laser-energy excitation energy of matrix-embedded sample
• matrix: UV-absorbing; prevents excessive fragmentation
c. SELDI (alternative to MALDI)• Chromatographic separation based on hydrophobicity, cation/anion
exchange, metal affinity
1. Background
Mass spectrometry in context Components of MS analysis (continued)
4. Sample Analyzer (TOF, quadrupole, tandem)
5. Ion detection/recording (m/z vs. intensity)
6. Calibration (internal or external)
7. Data analysis
1. Background
Mass spectrometry in context Spectra produced
Mass/charge ratio (m/z) plotted against relative intensity 104 - 106 data points per spectrum Sample SELDI-TOF spectrum:
1. Background
Overview of MS in the clinical domain Tissue types
Relatively non-invasive (e.g. Blood serum, Urine) Invasive (e.g. tissue biopsy, pancreatic juice)
Pathology types Newborn screening for metabolic diseases (e.g. PKU) Cancer (e.g. ovarian, prostate, pancreatic, lung) Non-cancer (e.g. hepatitis, cerebrovascular accidents)
Advantages of MS analysis Non-invasive Potential for early detection Early results appear promising
1. Background
Overview of MS in the clinical domain Brief historical considerations
[February 2002] Petricoin et al. use SELDI-TOF spectra from blood serum samples to create a classification model for ovarian cancer
[July 2003] Coombes et al. publish first publicly available pre-processing algorithms related to peak detection
Over 15 primary studies related to disease classification using MS data have been published
Early Generalizations Most proteome MS studies use MALDI-TOF or SELDI-TOF Most studies focus on blood serum proteome analysis Very few studies associated with publicly available dataset
1. Background
Challenges in MS analysis1. Sample collection
• Consistency/reproducibility of sample collection/processing• Abundance of clinically relevant biomarkers for disease
• Tumors produce very little biomarker [Diamandis]• Response: enzymatic amplification, “mopping” effect of larger
molecules
2. Data analysis• Lack of disclosure of methods and algorithms employed• Overfitting
3. Interpretation of results• Determination of clinical relevance (performance metric)• Determination of biological relevance
• Are biomarkers specific to disease of interest?• Example: generalized inflammatory response• Perhaps focus should be on studying tumor proteomes [Liebler]
1. Background
Challenges in MS analysis1. Microarray analysis of nucleotides
• All spots may be known a priori• The array is the same (spots are “aligned”) from sample to sample• Intensity represents extent of hybridization with known oligonucleotides• Possible to limit analysis to known physiologic/pathologic pathways
2. Mass spectrometry analysis of peptides• Peptides represented by peaks are not known a priori
• A peak may represent: noise, single peptide (known or unknown), peptide amalgamation
• M/Z values are not aligned from sample to sample• Peak alignment is not straight-forward• Not possible to limit analysis to known physiologic/pathologic pathways• Spectra may represent tens to hundreds of thousands of data points• Lack of software performing complete analysis
2. Development of FAST-AIMS
Requirements for an automated system• Goal 1: Complete analysis of unprocessed MS data
productive of diagnostic or prognostic models and associated biomarkers• Goal is being met through the development and application of
machine learning and biostatistics techniques by those with expertise.
• Goal 2: Creation of software that performs such analysis, allowing those without related expertise to have access to the information contained in clinical MS data.• No publicly available software system (commercial or free)
exists providing complete, high-quality first-pass analysis of MS data.
2. Development of FAST-AIMS
Overview of FAST-AIMS• FAST-AIMS
• Fully Automated Software Tool for Artificial Intelligence in Mass Spectrometry
• First software system to provide integrated analysis from raw data to model development
• Features Mass Spectrometry Data Preprocessing Able to accommodate a range of user expertise Avoids overfitting Does not require additional software Can perform three different tasks
1. Generation of a classification model2. Estimation of classification performance on unseen data3. Application of a generated model to new data
2. Development of FAST-AIMS
Technical description of FAST-AIMS Programming languages employed
• Wizard-like GUI interface: Delphi 7.0• Data analysis algorithms: Matlab 6.5• Integrated into a downloadable executable
Current dataset requirements:• Each row represents one spectrum• The first column contains class information
for each spectrum• binary (0,1): non-cancer/cancer• tertiary (0,1,2): healthy/benign/cancer
Non-Cancer
Cancer
Unknown
Cancer
TR
AIN
ING
TE
ST
Current operation: Data acquisitionBaseline subtractionPeak detectionPeak alignmentNormalizationFeature selectionBuild classification modelApply model to new dataEvaluate classification
2. Development of FAST-AIMS
Technical description of FAST-AIMS
2. Development of FAST-AIMS
Technical description of FAST-AIMS Model generation: experimental design
Generation of classifier/feature selection permutations Extending the range of parameters for a given algorithm
has a multiplicative effect on the number of permutations Selection of the following would yield 40 permutations:
Algorithms Parameters
FS1 (All features)FS2 (RFE) cost {10,100}; f {20, 40, 60} FS3 (HITON) α {0.3, 0.5, 0.7}
C1 (SVM) cost {10,100}; degree {1,2}
Model generation: experimental design Example: 5-fold cross validation with 4 FS/Classifier
permutations (P1,P2,P3,P4) to choose from
P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4
P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4
P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4
P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4
P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4
0.8 0.6 0.8 0.6
0.8 0.6 0.7 0.7
0.9 0.5 0.9 0.7
0.7 0.6 0.9 0.7
0.6 0.7 1.0 0.6
Avg. Perf.
0.8
0.8
0.9
0.7
0.6
0.6 0.8 0.6
0.6 0.7 0.7
0.5 0.9 0.7
0.6 0.9 0.7
0.7 1.0 0.6
P1: 0.76P2: 0.60P3: 0.86P4: 0.66P3: 0.86
2. Development of FAST-AIMS
Technical description of FAST-AIMS
3. Evaluation of FAST-AIMS
Evaluation one: multiple user study Elements of study protocol
Users selected based on maximizing range of expertise Dataset selection
should minimize chance of prior exposure should not have been used during FAST-AIMS development analysis should be non-trivial
All users were given an unrelated “practice” dataset FAST-AIMS users signed disclosure agreements and were
given: A general introduction to the study A copy of FAST-AIMS A FAST-AIMS tutorial document
Users were to work independently towards submission of a classification model for application on a withheld testing set
3. Evaluation of FAST-AIMS
Evaluation one: multiple user study Dataset: Prostate cancer (n=162) [Banez 2003]
Cancer (n=106), Non-cancer (n=56) Training (n=108), Testing (n=54)
Study participants designated as follows: FAST-AIMS users
Expert (n=4) Non-expert (n=2)
Non-FAST-AIMS users Biostatistician (n=1)
4. Conclusions
Results Mass Spectrometry Data analysis is currently very difficult
and requires expertise FAST-AIMS is the first system to show that this analysis can
be fully automated and that it can be performed by non-expert users
Two evaluations of FAST-AIMS were performed Evaluation one:
Non-expert and expert users of FAST-AIMS can achieve performances that nearly match that of expert biostatisticians in shorter time
All participants in the study achieved classification accuracy higher than previously published for the Banez dataset
Evaluation two: FAST-AIMS achieved high classification performance when
evaluating three different datasets
4. Conclusions
Future directions
Where FAST-AIMS is relatively weak Pre-processing techniques need further development Number of features (peaks) selected can still be very
high; a small trade off in accuracy may allow for selection of far fewer peaks
Lack of reporting of statistical significance of each peak Error estimation methods could be strengthened Interface issues
Lack of visualization Results are currently logged/reported in a text file
4. Conclusions
Future directions
Why FAST-AIMS is important Shows that analysis can be automated Accommodates
Naïve users: through a guided screen sequence and incorporation of defaults
Expert users: allows selection of a wide range of algorithms and associated parameters
Allows all users to harness computing power in aid of faster analysis
Even non-expert users can achieve classification performance greater than published results
Acknowledgements
With thanks extended to: Alexander Statnikov, Programmer, Study participant Yerbolat Dosbayev, Programmer, Study participant Kevin Maas, Study participant Yin Aphinyanaphongs, Study participant Yu Shyr, Ming Li (biostatistician component of evaluation one)
The presenter also wishes to thank his thesis committee: Constantin Aliferis, Primary thesis advisor Dean Billheimer, Department of Biostatistics Doug Hardin, Department of Mathematics Shawn Levy, Department of Biomedical Informatics Dan Liebler, Department of Biochemistry Ioannis Tsamardinos, Department of Biomedical Informatics
Addendum
Technical description of FAST-AIMSBased on user-defined parameters, the following sequence of
analysis is performed (all steps are logged):
1. Steps that can be performed on an individual spectra independently of other spectra are performed on all samples• peak identification, baseline subtraction [Coombes 2003]
2. Data is partitioned based on the task(s) chosen.
Generate model Estimate performance
The dataset is divided into n-subsets. Samples are assigned to each subset randomly, while maintaining class distribution within each subset. Each subset forms the testing set for a partition. The remainder of the dataset becomes the training set for that partition.
The procedure for generating a model is followed. For each partition, each of the (n-1) subsets associated with each training set forms a train-test subset for a nested-partition. The remainder of the training set becomes the train-train set for that nested-partition.
2. Development of FAST-AIMS
Technical description of FAST-AIMS3. Remaining pre-processing steps (requiring evaluation of
multiple spectra) are performed within each training (or train-train) set• normalization sequence, peak alignment [Yasui 2003]
4. All permutations of feature selection and classification algorithms and associated parameters selected are determined.
5. Each permutation is used to generate a set of features (based on feature selection) and a classification performance as follows:
Generate model Estimate performance
Feature selection is performed on each training set. Dimensionality of the dataset is reduced based on features selected. Classifier builds a model on reduced dataset. Model is applied to testing set and classification performance is recorded based on user-selected metric (ROC or accuracy).
Within each training set, the procedure for generating a model is followed (considering train-train sets and train-test sets).
6. Classification models are generated independent of classification of testing data:
7. Results are determined:
Generate model Estimate performance
A single classification model is generated by averaging the classification performance of each permutation on each testing set and choosing the permutation with best performance.
Within each training set, a classification model is generated by averaging the classification performance of each permutation on each train-test set and choosing the permutation with best performance. n models are thus generated.
Generate model Estimate performance
Result is the classification model determined (permutation and user-defined pre-processing steps) and associated average performance.
The optimized model for each training set is applied to the associated test set for each partition. A non-overfitted classification performance is determined for each partition. The average of these is reported as estimated performance.
2. Development of FAST-AIMS
Technical description of FAST-AIMS