mzMatch Excel Template - University of Strathclyde · • V: Relation.id (from mzMatch) • W: Peak...
Transcript of mzMatch Excel Template - University of Strathclyde · • V: Relation.id (from mzMatch) • W: Peak...
mzMatch Excel Template
Tutorial
Installation & Requirements
• Installation
• The template may be used to process mzMatch output text files without
additional installations or add-ins.
• Microsoft Excel 2007 required (2003 not sufficient, 2010 not tested)
• Requirements for full function
• R Statistical Software : for mzmatch pre-processing
R packages: XCMS (BioC), mzMatch.R (Rforge), rJava and XML (CRAN)
R package: rCDK : for FormulaGenerator
• Firefox or Internet Explorer : for Hyperlinks to online databases
• Thermo Xcalibur : for EIC lookup
• ReAdW : for conversion of .RAW to .mzXML files
• If you wish to use R and Xcalibur links: Open the template and update cells
D44 and D45 (on the Settings sheet) to the relevant paths on your
computer
Data Pre-processing
• Step 1 - Setup
• Open “mzMatch_Template.xltm” and SaveAs “yourfile.xlsm”
(Macro enabled workbook)
• Go to the “Settings” sheet
• Update cells D44 and D45 (on the Settings sheet) to the relevant
paths on your computer
• Step 2 - Convert RAW files to centroided mzXML files
• Save a copy of ReAdW.exe into the folder with your RAW data
• Click „Convert RAW to mzXML files‟ to run conversion
Data Pre-processing
• Step 3
• If files are from Exactive, split into
Pos and Neg using the Blue button
• Step 4
• Select „positive‟ or „negative‟ mode in cell K1
(only process one mode at a time)
• For each polarity: sort replicate .mzXML files into folders
according their experimental groups (sets).
• Check over the blue-shaded settings for mass, RT and Relatedpeaks
windows, and RSD filter. (xcms parameters can be changed in the macro)
• Run xcms/mzMatch with the purple „Combined Button‟
NOTE: Files must be sorted into sets (folders) to run RSD filter
NOTE: If xcms crashes in negative mode try selecting „mzData alt
method‟ in cell K2
• mzMatch output files will be saved in the folder with your files
Peak data import
• Step 1
• Import mzMatch output file “combined_related.txt” using the big
Red button (Settings sheet)
• Manually check that replicate samples are in adjacent columns
(if not, get cutting and pasting!)
• Step 2
• On the “Settings” sheet, enter the number of replicates in each
set (column F)
NOTE: if you have named samples with set prefixes, the next
Green button will do this for you
• Choose the Set-Type for each set using drop-down options in
column C
NOTE: hover mouse over cell C8 for more information
Update metabolite DB
• Step 1
• Externally, prepare a list of actual retention times for
authentic standards analysed under your current
chromatographic conditions. Any excel-readable file with
name, RT and mass (optional) in columns can be directly
imported. ToxID is good for this.
NOTE: names must exactly match those in DB. (except
that “,” can be replaced by “_” )
• Step 2
• Select the „Rtcalculator‟ sheet
• Enter the dead-volume time for your chromatographic
column (cell O9)
• Scroll to the right and manually update expected retention
times for given Pathways, Maps and Properties (if known)
• (optional) enter metabolite names and RT‟s for authentic
standards in columns A:B and W:X
NOTE: These can be entered automatically from an
external excel/tsv/csv file in step 3
Update metabolite DB
• Step 3
• Run the „Update Retention Times in DB‟ macro from either
„Settings‟ or „Rtcalculator‟ sheet
• If the prediction model looks good (ie r2 > 0.6), agree to update
RT‟s in DB, otherwise try altering the variables (cells E1:J1) to
suit your chromatography, and re-run the macro
• Step 4 (optional)
• If you have a species-specific database (eg. From metacyc or
KEGG) enter these annotations in column G (“PreferredDB”) of
the DB sheet.
NOTE: This can be simplified by matching database identifiers
using Excel‟s „Vlookup‟ function
• Select the entire database and Custom Sort: sort by „searchmass‟
(ascending) then by „PreferredDB‟ (ascending) to ensure
annotated metabolites are at the top of the list of each group of
isomers.
Run Metabolite Identification
• Step 1
• On the “Settings” sheet, check over the settings in columns F and I
are suitable
Most commonly changed settings are:
• Identification RT windows (F3 and F4) and mass window (F6)
• RT window for duplicate peaks (I9)
• MaxIntensity cutoff (I10)
• Select the adducts (cells K15:K21) that you wish to include in the
identification search
• Step 2
• Click „Run Identification Macro‟ on the Settings sheet
• This could take from 2 to 20 minutes
• Save the file as soon as the macro is finished
Metabolite Identification: Process
• Metabolite Identification Macro
• This macro annotates information to every peak in the „alldata‟ sheet
• Apon completion, all basepeaks are copied to the „allBasePeaks‟ sheet
• All identifications with confidence < 5 are copied to the „notlikely‟ sheet
• All identification with confidence => 5 are copied to the „identification‟ sheet
The identifications sheet is then checked for duplicates and shoulder peaks, and these are
moved to the „notlikely‟ sheet
Metabolite Identification: Process
• Peak Information columns
• A: neutral exact mass (from mzMatch)
• B: Retention Time (from mzMatch) in minutes
• C: Formula from DB with closest match to mass (if within ppm window)
• D: Number of isomers in DB with this exact formula
• E: Metabolite name: best match from DB for this mass and RT
• F: Confidence level according to parameters on „settings‟ sheet
• G: Records whether the metabolite is in a „preferred database‟ (from DB)
• H: Map: the general area of metabolism for this metabolite (usually from KEGG)
NOTE: column H can be changed by choosing a different header in cell H1
• I: mass error (in ppm) from nearest match in DB (if within 2 x ppm window)
• J: RT error relative to authentic standard (white) or predicted RT (grey) as % of RT
• K: altppm: mass error for the next closest mass in the DB (if within ppm window)
• L: Sig: records which sample sets are significant (peaks > blank and RSD < window)
Metabolite Identification: Process
• Peak Information cont.
• M: BP: Basepeak for that peak
• N: Mzdiff: mass difference between this peak and the basepeak
For basepeaks this column records common adducts/fragments/isotopes that were found
• O: relation.ship: relationship to the basepeak (according to mzMatch)
• P: addfrag: common adduct, fragment or neutral-loss
• Q: % error of C13-isotope intensity from theoretical
• R: % error of isotope intensity from theoretical for (Cl, S, N, O or H)
• S: RSD for QC samples (or for Treatment if no QC)
• T: minimum RSD for all included sample sets
• U: maximum intensity from all included sets
• V: Relation.id (from mzMatch)
• W: Peak Intensity ratio for mean of „treatments‟ vs mean of „controls‟
• X: P-value for unpaired T-test between „treatments‟ and „controls‟
• Y: Adduct of formula match to mass (ie H, Na, double-charge, etc)
• Z: Polarity
• AA: Number of detected peaks in included sets
Re-calibrate mass accuracy
• Step 1
• On the “Settings” or „Identification‟ sheet, click the “ppm check” button
• If the polynomial curve looks like a good fit, agree to re-calibrate
masses, otherwise, investigate the mass calibration manually
• Step 2
• Sort the „identification‟ sheet by ppm error (use the blue „sort‟ button)
• Remove metabolites with large errors (>1.5 ppm) by cut/paste to the
„notlikely‟ sheet
NOTE: easiest to manually annotate all mis-annotated peaks (in
column F), re-sort and move them all at once
NOTE: delete rows that have been removed (even if they appear
empty) to speed up processing
• Double-check the „altppm‟ column for alternate identifications before
you remove peaks
Manual Data Filtration
• Step 1 – recover false rejections
• Go to the „notlikely‟ sheet, check for „false rejections‟, particularly
with confidence of 4. (technical judgement required)
• Cut/paste false rejections onto the „identification‟ sheet
• Step 2 – manual filtration
• On the „Identification‟ sheet, check for „false positives‟ and move
to „notlikely‟ sheet by cut/paste, or by the „remove row‟ button
• Press the „colouring‟ button to make interpretation easier
• Press the „hyperlink‟ button to activate weblinks
• Use the Sort functions, info-boxes, graphs and hyperlinks to assist
(columns B,D,K,L,W)
• Step 3 – manual identification
• On the „Identification‟ sheet, check for duplicate identifications,
and choose alternative isomers where appropriate
• Manual Filtration: suggested process
1. Related Peaks (mass difference, neutral loss)
2. Retention Time limits (min, max, %error)
3. Adduct likelihood (2+ or Na+)
4. Isomers (split peaks, duplicate identifications)
5. Isotopic abundance (C13 isotope, other unique isotopes)
6. Peak shape (check chromatogram if codadw < 0.95)
7. Biological likelihood (related pathways, common contaminants)
Manual Data Filtration
Biological data analysis
• Step 1
• If you have exactive pos/neg data, run the „Combine Pos/Neg‟
function after processing each set individually
• Step 2
• Run the Intensity comparison macro from the „Identification‟ sheet
or „settings‟ sheet by clicking „Compare All Sets‟. This calculates
mean and SD for each set and compares each set to the
designated „control‟ group (relative intensity and t-test).
• Step 3
• In the „Comparison‟ sheet, sort data by your column of interest:
• Relative intensity vs control
• P-value (t-test) vs control
• Metabolite Map or KEGG Pathway
• Use buttons at the top to plot graphs or export to motif/metexplore
Multivariate analysis
• Step 1
• This template doesn‟t incorporate functionality for multivariate
analysis, use the light blue Export button to export either
„allBasePeaks‟ or „Identifications‟ to Metaboanalyst, or R/matlab/etc
for further analysis
• Step 2
• If you wish to analyse all Basepeaks, run the „assign Basepeaks‟
macro to help with annotation
• Step 3
• Unidentified masses can be investigated by clicking the empty
„formula‟ (C) cell – this will run FormulaGenerator in R
Other Features
• Additional Macros:
• Isotope Search • for untargeted metabolic labelling studies
• C13, N15 and O18 supported
• Combine Datasets • combines negative and positive data (from same column)
• Formula Generator • Identify formulae for unknown masses (uses rCDK)
• Checks validity of formulae against “Fiehn‟s Golden Rules”
Other Features
• Additional Functions (Excel formula’s):
• FormulaMatch – looks up a mass in the database
• ExactMass – calculates exact mass of a formula
• PPMcalc – calculates the mass error from a given mass or formula
• IsotopeAbundance – Calculates the theoretical isotopic abundance for a
given atom in a formula
• FormulaValid – checks formula validity against 5 Golden Rules
• AtomCount – returns the number of specified atoms in a formula
• Pos – calculates the positive charge at a given pH (given # cations & basic pKa‟s)
• Neg - calculates the negative charge at a given pH (given # anions & acidic pKa‟s)
FAQ
• WHERE TO START... which sheet?
All automated functions can be run from the Settings sheet.
After automated filtration and identification you can do manual curation on the „identification‟ sheet, including the
mass re-calibration. Additional metabolites can be retrieved from the „notlikely‟ (or „allBasePeaks‟) sheet simply by
using the cut/paste functions in Excel; it is recommended to cut/paste whole rows rather than individual cells. The
easiest approach for meaningful biochemical analysis is to run the „Compare all‟ function and sort the „Comparison‟
sheet according to your interests. Additional columns (eg. stats, normalised intensities, other information) can
always be added to the right of the existing data without affecting macro performance.
• POLARITY:
The polarity is automatically corrected by mzMatch.R during the peak picking process, and all masses that appear
in the Template are corrected neutral masses. Ensure that you set the correct 'polarity' option on the 'settings'
sheet before running anything. The polarity setting is also useful for combining positive and negative mode data,
and for the quicklink to Xcalibur qualbrowser EICs. (i.e. whether to add or subtract a proton to get from neutral
mass back to m/z).
Note: Due to the automatic polarity correction by mzMatch, the masses of cations in the database have been
corrected by one proton. (eg. The mass of choline in the DB is 103, rather than actual mass of 104).
FAQ
• WHICH FILE TO USE FOR THE RETENTION TIME UPDATER?
You need to manually generate a list of retention times for authentic standards under the current LC conditions.
The simplest way is to use Toxid (or similar), otherwise do it manually from raw data.
The retention time updater has been tested on Toxid .csv output files. However it should work for any excel-
readable file that has a column for metabolite names and a column for retention times. (Note: the metabolite name
must be identical to the name in the database - the only exception is that underscore "_" may be used in the place
of comma "," to avoid issues with .csv files).
• IF IT RUNS SLOWLY?
The peak-picking process in XCMS is quite slow, this can be left to run overnight if you have many samples. The
speed of mzmatch.R functions and Excel macros will depend on the number of samples, number of detected
peaks, and your computer speed. Speed can be improved by applying tighter filters earlier in the process (eg.
Peak picking parameters and RSD filter), however this may cause loss of some peaks of interest.
Visualisation of results in Excel can be slow if there are many active formulas. Try turning automatic calculation off,
de-activating Hyperlinks, or running the „Trim file size‟ macro.
Any Further Questions/Ideas
mzMatch information available at:
Mzmatch.sourceforge.net
Xcms information available at:
metlin.scripps.edu/xcms/
Information about this mzMatch template available directly from:
Dr Darren Creek
University of Glasgow