Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

17
1 Welcome! Mass Spectrometry meets Cheminformatics Tobias Kind and Julie Leary UC Davis Course 9: Prediction and simulation of mass spectra Class website: CHE 241 - Spring 2008 - CRN 16583 Slides: http://fiehnlab.ucdavis.edu/staff/kind/Teaching/ PPT is hyperlinked – please change to Slide Show Mode

Transcript of Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

Page 1: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

1

Welcome!

Mass Spectrometry meets CheminformaticsTobias Kind and Julie Leary

UC Davis

Course 9: Prediction and simulation of mass spectra

Class website: CHE 241 - Spring 2008 - CRN 16583Slides: http://fiehnlab.ucdavis.edu/staff/kind/Teaching/PPT is hyperlinked – please change to Slide Show Mode

Page 2: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

2

History of artificial intelligence and mass spectrometry

Dendral project at Stanford University (USA)Started in 1960sPioneered approaches in artificial intelligence (AI)

Aim: Prediction of isomer structures from mass spectraIdea: Self-learning or intelligent algorithm

Participants:Lederberg, Sutherland, Buchanan, Feigenbaum,Duffield, Djerassi, Smith, Rindfleisch, many others…

[Dendral PDF]Figure: Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry

Page 3: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

3

Prediction and simulation of mass spectra

A) Prediction of the isomer structure or substructures from a given mass spectrumThe structure is directly deduced from the mass spectrum or generated by a molecular isomer generator or existing structures can be found in a structure database

B) Simulation of a mass spectrum from a given isomer structureThe mass spectral peaks and abundances are generated by a machine learning algorithmThe structures can be obtained from a isomer database (PubChem, LipidMaps)or a sequence database (Swiss-Prot, NCBI) in case of proteins

(mainlib) Coronene40 60 80 100 120 140 160 180 200 220 240 260 280 300

0

50

100

100 122 136

150

168 222 246 268

300

(mainlib) Coronene40 60 80 100 120 140 160 180 200 220 240 260 280 300

0

50

100

100 122 136

150

168 222 246 268

300

Page 4: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

4

Application of machine learning for detection of substructures from mass spectra

Data Preparation

Feature Selection

Model Training +

Cross Validation

Model Testing

Basic Statistics, Remove extreme outliers, transform or normalize datasets, mark sets with zero variances

Predict important features with MARS, PLS, NN, SVM, GDA, GA; apply voting or meta-learning

Use only important features, apply bootstrapping if only few datasets;

Use GDA, CART, CHAID, MARS, NN, SVM, Naive Bayes, kNN for prediction

Calculate Performance with Percent disagreement and Chi-square statistics

Model DeploymentDeploy model for unknown data;

use PMML, VB, C++, JAVA

What is machine learning?

Page 5: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

5

Prediction of substructures from mass spectra

Picture source: amdis.net

Working examples for EI mass spectra:Varmuza classifiers in AMDIS and MOLGEN-MS

Substructure algorithm (Stein S.E.) Implemented in NIST-MS search program

Mass spectral classifiers for supporting systematic structure elucidationVarmuza K., Werther W., J. Chem. Inf. Comput. Sci., 36, 323-333 (1996). Chemical Substructure Identification by Mass Spectral Library SearchingS.E. Stein, J. Am. Soc. Mass Spectrom., 1995, 6, (644-655)

Page 6: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

6

Substructures deduced from mass spectra for generation of isomer structures

Picture source: amdis.net

1) Molecular formula must be known - can be detected from molecular ion and isotopic pattern 2) Good-list (substructure exists) and bad-list (substructure not existent) approach 3) Sub-structures are combined in deterministic or stochastic (random) manner4) Database or molecular isomer generator (combinatorial, graph theory) approach for

generating or finding possible structure candidates

Example: Molecular formula C6ClH5O; calculated from molecular ion

Goodlist:

Badlist:

Database (Chemspider): 25 hits(including all possible existing structures)

MOLGEN Demo: All constructed isomers: 8372

-benzene-hydroxy-chlorine

Total: 3 possible results

Page 7: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

7

Simulation of mass spectra

Why is simulation of mass spectral fragmentation important?

Imagine – you have a structure database of all moleculesImagine – you can simulate mass spectra for all these moleculesImagine – you can match your experimental spectra against a database of calculated spectra

Isomer DBGenerate mass spectra

with Machine Learning Algorithm

(mainlib) D(+)-Talose10 30 50 70 90 110 130 150 170 190

0

50

100

15

31

43

6073

91 101 119131 144

10 30 50 70 90 110 130 150 170 1900

50

100

31 43

60

73

91 101

10 30 50 70 90 110 130 150 170 1900

50

100

31 43

60

73

91 101

10 30 50 70 90 110 130 150 170 1900

50

10031

43

60

73

119

131 144

10 30 50 70 90 110 130 150 170 1900

50

10031

43

60

73

119

131 144

MS DBof theoretical spectra

10 30 50 70 90 110 130 150 170 1900

50

100

15

31

43

60

73

91 101

10 30 50 70 90 110 130 150 170 1900

50

100

15

31

43

60

73

91 101

Experimental mass spectrum

Compare MS(calc) vs. MS(exp)

If the calculation is simple the database is not needed;In-silico MS fragments can be calculated on-the-fly

Page 8: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

8

Simulation of alkane mass spectra (I)

ApproachUse of artificial neural networks (ANN) (machine learning)Electron impact spectra 70 eVSubstructure descriptors were used for calculationSelection of 44 m/z positions – training was performed for correct intensity

117 noncyclic alkanes and 145 noncyclic alkenes training set: 236 molecules prediction set: 26 compounds

ProblemsPrediction or validation set very small (should be 30%)Prediction of molecular ion (usually very low abundant)Overfitting possible, works only for selected substance classes

Source: WIKI

Source: Jalali-Heravi M. and Fatemi M. H.; Simulation of mass spectra of noncyclic alkanes and alkenes using artificial neural network

Page 9: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

9

Simulation of alkane mass spectra (II)

Source: Jalali-Heravi M. and Fatemi M. H.; Simulation of mass spectra of noncyclic alkanes and alkenes using artificial neural networkAnalytica Chimica Acta; Elsevier permission use for coursepack/classroom material

2,3,3-trimethylpentane (a and b) and 2,3,4-trimethylpentane (c and d).OKVWYBALHQFVFP-UHFFFAOYAT RLPGDEORIPLBNF-UHFFFAOYAR

Structures: Chemspider

Page 10: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

10

Simulation of lipid tandem mass spectra (I)

Picture: Thanks to Yetukuri et al. BMC Systems Biology 2007 1:12 doi:10.1186/1752-0509-1-12

Single examples

Similar structures; plus CH2 in side chains sn1 and sn2; double bonds possibleSimilar and almost constant fragmentation rulesLoss of head group (diagnostic ion in MS and MS/MS spectrum)Loss of rest one (R1) and rest two (R2) can be observed in MS/MS spectrum

Page 11: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

11

Simulation of lipid tandem mass spectra (II)

Spectrum Source:Lipidmaps.org

C45H82NO8PGPCho269.2481303.2324526.3297544.3403492.3453510.355920:4(5Z,8Z,11Z,14Z)/17:0437796.5856C45H82NO8PGPCho303.2324269.2481492.3453510.3559526.3297544.340317:0/20:4(5Z,8Z,11Z,14Z)437796.5856C43H74NO10PGPSer269.2481301.2168526.2569544.2675494.2882512.298820:5(5Z,8Z,11Z,14Z,17Z)/17:0537796.5128C43H74NO10PGPSer301.2168269.2481494.2882512.2988526.2569544.267517:0/20:5(5Z,8Z,11Z,14Z,17Z)537796.5128C40H77O13PGPIns227.2011269.2481569.309587.3196527.2621545.272717:0/14:0031797.5180C40H77O13PGPIns269.2481227.2011527.2621545.2727569.309587.319614:0/17:0031797.5180FormulaHGsn2 acid(-)sn1 acid(-)M-sn2-H2O+HM-sn2+HM-sn1-H2O+HM-sn1+HAbbrev.DBCMass

C45H82NO8PGPCho269.2481303.2324526.3297544.3403492.3453510.355920:4(5Z,8Z,11Z,14Z)/17:0437796.5856C45H82NO8PGPCho303.2324269.2481492.3453510.3559526.3297544.340317:0/20:4(5Z,8Z,11Z,14Z)437796.5856C43H74NO10PGPSer269.2481301.2168526.2569544.2675494.2882512.298820:5(5Z,8Z,11Z,14Z,17Z)/17:0537796.5128C43H74NO10PGPSer301.2168269.2481494.2882512.2988526.2569544.267517:0/20:5(5Z,8Z,11Z,14Z,17Z)537796.5128C40H77O13PGPIns227.2011269.2481569.309587.3196527.2621545.272717:0/14:0031797.5180C40H77O13PGPIns269.2481227.2011527.2621545.2727569.309587.319614:0/17:0031797.5180FormulaHGsn2 acid(-)sn1 acid(-)M-sn2-H2O+HM-sn2+HM-sn1-H2O+HM-sn1+HAbbrev.DBCMass

ExperimentalMass spectrum

In-silico predictionof MS/MS mass spectral fragments

Simulation of tandem mass spectraor MS/MS fragment data fromLipidMaps

Page 12: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

12

Simulation or prediction of oligosaccharide spectra(carbohydrate sequencing)

See Oscar and FragLibSee GlySpy

Source: Congruent Strategies for Carbohydrate Sequencing. 3. OSCAR: An Algorithm for Assigning Oligosaccharide Topology from MSn Datahttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1435829

Consistent building blocks (sugars)Consistent fragmentation allows in-silico fragment predictionPre-calculated fragments from known structures can be stored in database (use NIST-MS-Search)Algorithm works also on-the-fly without databaseDe-novo algorithms work for truly unknown structures

Page 13: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

13

Simulation of peptide fragmentations(De-novo sequencing of peptides)

Principle:De-novo sequencing of peptides (determine amino acid sequences)De-novo algorithms can perform permutations and combinatorial calculations from all 20 amino acids (superior if the sequence is not found in a database)Highly dependent on good mass accuracy (less than 1 ppm) of precursor ion and MS/MS fragments Generate match score by matching in-silico fragments against experimental MS/MS spectrum

Problems: Leucine and isoleucine have same massPost translational modifications (PMTs)Missing fragment peaks

Picture source: MWTWIN help file2 (Monroe/PNNL)Picture 2 source: Tandem mass spectrometry data quality assessment by self-convolutionKeng Wah Choo and Wai Mun Tham http://www.biomedcentral.com/1471-2105/8/352

Page 14: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

14

The Last Page - What is important to remember:

Fragmentation and rearrangement rules and ion physics can be programmed into algorithmsAbundance calculations are problematic

Prediction of isomer substructures from mass spectra is possibleWorks for reproducible mass spectra

A simplified simulation of mass spectra and simulation of fragmentation patternis only possible for certain molecule classes

Works only for peptides, lipids, oligosaccharides, alkanesDoes not work for all other molecules Does not work with complex (side chain) modifications

Machine Learning Methods for simulation and prediction of mass spectrarequire a large pool of diverse experimental mass spectra and MSn spectra for training

Page 15: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

15

Tasks (42 min):

Download one of the following tools:MOLGEN, MOLGEN-MS, AMDIS, OMMSA, OSCAR or any free/commercial/demoprogram for in-silico peptide fragment determination or de-novo sequencing.Report on use.

Page 16: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

16

Literature (36 min):

Mathematical tools in analytical mass spectrometry [DOI]Metabolomics, modelling and machine learning in systems biology – towards an understanding of the languages of cells [DOI]Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry [PDF]Mass Analysis Peptide Sequence Prediction [LINK]

Page 17: Prediction and Simulation of Mass Spectra - Metabolomics Fiehn Lab

17

Links:

Used for research: (right click – open hyperlink)http://scholar.google.com/scholar?hl=en&q=%22Simulation+of+mass+spectrahttp://scholar.google.com/scholar?num=100&hl=en&lr=&safe=off&q=+Simulation+of+%22mass+spectral+fragmentationhttp://www.google.com/search?num=100&hl=en&safe=off&q=in-silico+prediction+tandem+mass+spectra&btnG=Searchhttp://www.aseanbiotechnology.info/Abstract/21020883.pdfhttp://www.google.com/search?hl=en&q=GNU+polyxmass%2C&btnG=Google+Searchhttp://www.google.com/search?hl=en&q=C41H76N2O15&btnG=Google+Searchhttp://www.google.com/search?num=100&hl=en&safe=off&q=MOLGEN+MS&btnG=Searchhttp://www.google.com/search?hl=en&q=G.+L.+Sutherland&btnG=Google+Search

GlySpy and the Oligosaccharide Subtree Constraint Algorithm (OSCAR)See Mass Frontier for further discussionMOLGEN-MS [LINK]

Of general importance for this course:http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/Structure_Elucidation/