Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak...

10
University of Groningen Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry Imaging Data Eriksson, Jonatan O.; Rezeli, Melinda; Hefner, Max; Marko-Varga, György; Horvatovich, Péter Published in: Analytical Chemistry DOI: 10.1021/acs.analchem.9b02637 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2019 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Eriksson, J. O., Rezeli, M., Hefner, M., Marko-Varga, G., & Horvatovich, P. (2019). Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry Imaging Data. Analytical Chemistry, 91(18), 11888-11896. https://doi.org/10.1021/acs.analchem.9b02637 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 30-12-2020

Transcript of Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak...

Page 1: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

University of Groningen

Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently MineMass Spectrometry Imaging DataEriksson, Jonatan O.; Rezeli, Melinda; Hefner, Max; Marko-Varga, György; Horvatovich,PéterPublished in:Analytical Chemistry

DOI:10.1021/acs.analchem.9b02637

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2019

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Eriksson, J. O., Rezeli, M., Hefner, M., Marko-Varga, G., & Horvatovich, P. (2019). Clusterwise PeakDetection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry Imaging Data.Analytical Chemistry, 91(18), 11888-11896. https://doi.org/10.1021/acs.analchem.9b02637

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 30-12-2020

Page 2: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

Clusterwise Peak Detection and Filtering Based on SpatialDistribution To Efficiently Mine Mass Spectrometry Imaging DataJonatan O. Eriksson,† Melinda Rezeli,† Max Hefner,† Gyorgy Marko-Varga,†

and Peter Horvatovich*,‡,†

†Lund University, Department of Biomedical Engineering, Lund, Sweden‡University of Groningen, Department of Analytical Biochemistry, Groningen Research Institute of Pharmacy, Antonius Deusinglaan1, 9713 AV Groningen, The Netherlands

*S Supporting Information

ABSTRACT: Mass spectrometry imaging (MSI) has the potential to reveal thelocalization of thousands of biomolecules such as metabolites and lipids in tissuesections. The increase in both mass and spatial resolution of today’s instrumentsbrings on considerable challenges in terms of data processing; accurately extractingmeaningful signals from the large data sets generated by MSI without losinginformation that could be clinically relevant is one of the most fundamental tasks ofanalysis software. Ion images of the biomolecules are generated by visualizing theirintensities in 2-D space using mass spectra collected across the tissue section. Theintensities are often calculated by summing each compound’s signal betweenpredefined sets of borders (bins) in the m/z dimension. This approach, however, canresult in mixed signals from different compounds in the same bin or splitting the signalfrom one compound between two adjacent bins, leading to low quality ion images. Toremedy this problem, we propose a novel data processing approach. Our approachconsists of a sensitive peak detection method able to discover both faint and localized signals by utilizing clusterwise kerneldensity estimates (KDEs) of peak distributions. We show that our method can recall more ground-truth molecules, moleculefragments, and isotopes than existing methods based on binning. Furthermore, it automatically detects previously reportedmolecular ions of lipids, including those close in m/z, in an experimental data set.

Mass spectrometry imaging (MSI) is a technique oftenused to study the localization of known and unknown

biomolecules such as lipids, metabolites, or peptides in tissue.Today’s instruments can scan samples with both high spatial andmass spectral resolution and, consequently, generate massivedata sets that require highly efficient and accurate processing.Thus, one of the key components of MSI data processing is data-reduction, which typically involves detection and extraction ofsignals originating from tissue or drug compounds whilediscarding noise.1,2 The peaks of each spectrum are mappedonto a common reference, and by visualizing the intensities ofindividual peaks as images the spatial distribution ofbiomolecules can be revealed. The reference spectrum isgenerated by detecting peaks which are common to multiplespectra. Accurate peak detection facilitates the isolation ofsignals from individual compounds which is necessary to obtainhigh quality images.Many existing MSI software, such as Cardinal3 and

MALDIquant,4 detect isotopic peaks of compounds on a dataset mean spectrum and subsequently rank them based on thefrequency of their presence in ion image pixels. This method isfast and produces concise peak lists but has limited performancefor low-intensity peaks and those localized to small regions in theanalyzed tissue section.1 Many tools generate ion images bybinning around each peak of interest; the intensity value for each

pixel is calculated by summing ion intensities betweenpredefined m/z borders (bins). When doing this, however, itis crucial to use narrow bins to avoid mixing signals frommultiple compounds in one image and to ensure that the mass ofthe peak around which binning is performed is accurate.Suits et al.5 showed that slicing the entire m/z range into ion

images of fixed mass widths enables MSI practitioners to exploreMSI data sets in a hypothesis-free manner. This approach sets nothreshold on either peak intensity or presence in a minimumnumber of pixels and is thus not biased toward large or highintensity molecules in the tissue. Choosing bin width is aspecificity-sensitivity trade off. A small bin width results inhigher sensitivity but increases the risk of peak splitting and ahigher number of empty or noninformative ion images. Largerbin widths on the other hand result in fewer noninformativeimages but are unable to discriminate between compounds thatare close in mass, resulting in ion images containing signals frommultiple compounds. Unfortunately, even when using relativelylarge bin widths, slicing leads to impractically large sets of ion-images unless the experimentalist is guided by known ionmasses. However, previous studies have demonstrated that

Received: June 9, 2019Accepted: August 12, 2019Published: August 12, 2019

Article

pubs.acs.org/acCite This: Anal. Chem. 2019, 91, 11888−11896

© 2019 American Chemical Society 11888 DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

This is an open access article published under a Creative Commons Non-Commercial NoDerivative Works (CC-BY-NC-ND) Attribution License, which permits copying andredistribution of the article, and creation of adaptations, all for non-commercial purposes.

Dow

nloa

ded

via

UN

IV G

RO

NIN

GE

N o

n Se

ptem

ber

30, 2

019

at 0

9:43

:43

(UT

C).

See

http

s://p

ubs.

acs.

org/

shar

ingg

uide

lines

for

opt

ions

on

how

to le

gitim

atel

y sh

are

publ

ishe

d ar

ticle

s.

Page 3: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

incorporating information about the ion-images’ spatialstructure in MSI data analysis pipelines is an effective way toautomatically separate high and low quality images in these largeimage sets.6−9

In this paper, we present a peak detectionmethod that enablesautomatic detection of faint and localized signals as well as highintensity and/or abundant signals. We show that our peakdetection can serve as a part of anMSI data analysis pipeline thatis both sensitive and specific by combining it with establishedmethods that filter peaks based on their spatial arrangement. Asensitive peak detection algorithm is not only essential forexploratory analysis but also for discovering molecules spatiallycolocalized with those expected to be present, e.g., drugcompounds and metabolites. This is highly relevant in bothscientific and clinical settings where drug−tissue interaction andtissue composition are often investigated. To assess andcompare the performance of our method to existing MSI data

processing tools, we used a rat liver section spiked with severaldrugs, most of which are anticancer drugs, where the masses ofthe spiked drugs are used as ground-truth. Using this data set, weshow that we are able to detect drug peaks as well as fragmentand isotopic peaks, including those that are close inm/z to moreintensive and/or abundant peaks. We also used the MSI data setfrom a mouse bladder section originally presented by Rompp etal.10 to further assess our method.

■ MATERIALS AND METHODS

Drug Compounds and Matrix Composition. For theMALDI-MSI experiment, we selected 12 different drugs (seechart in Supporting Information). The drugs were purchasedfrom the LC Laboratories (Woburn, MA; CAS numbers:dabrafenib: 1195765-45-7, dasatinib: 302962-49-8, erlotinib:183321-74-6, gefitinib: 184475-35-2, imatinib: 152459-95-5,lapatinib: 388082-78-8, pazopanib: 444731-52-6, sorafenib:

Figure 1. Flowchart of our peak picking algorithm.m/z values of peaks from each individual spectrum are collected and sorted inmzall. We then identifyclusters in mzall as connected components in a directional graph. For each cluster we fit an optimized KDE to the distribution of m/z values. Data setpeaks are obtained as local maxima on the resulting KDE curve. Finally, the level of structure in the ion images corresponding to the data set peaks isestimated and used to filter out noise peaks. The peak corresponding to the center ion image, atm/z = 494.2505, is an example of one filtered out in thelast step.

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11889

Page 4: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

284461-73-0, sunitinib: 557795-19-4, trametinib: 871700-17-3,vatalanib: 212141-54-3) and from SelleckChem (Munich,Germany; CAS numbers: ipratropium: 60205-81-4) with>99% purity and were dissolved in methanol (MeOH,(Chromasolv Plus for HPLC) (Sigma-Aldrich, Steinheim,Germany) at 10 mg/mL concentration. These stock solutionswere further diluted with 50% MeOH and five mixtures weregenerated, each containing four different drug compounds. Thespreadsheet in Supporting Information summarizes thecomposition of the five drug mixtures. A 5 mg/mL solution ofα-cyano-4-hydroxycinnamic acid (CHCA, Sigma-Aldrich)dissolved in 50% MeOH containing 0.1% trifluoroacetic acid(TFA, Sigma-Aldrich, Steinheim, Germany) was used as matrixsolution.Sample Preparation. For MALDI-MSI, a 10 μm section

was cut from frozen rat liver tissue using a cryotome and placedon a glass slide. Then 0.3 μL from each drug mixture waspipetted on the tissue section at predefined positions. Afterdrying of the tissue, CHCAmatrix solution was deposited on thetissue surface by an automated pneumatic sprayer (TM-Sprayer,HTX Technologies). The nozzle distance was 46 mm, and thespraying temperature was set to 35 ◦C, the matrix was sprayed(19 passes) over the tissue section at a linear velocity of 750mm/min with a flow rate set to 0.1 mL/min and a nitrogenpressure set at 10 psi. After each pass, a drying time of 30 s wasset on the spraying machine to give time for the sample to drycompletely before the next pass. The frozen rat liver tissue wasprovided by Prof. Roland Andersson (Dept. Clinical SciencesLund (Surgery), Skane University Hospital, Lund University).Animals were housed and bred according to regulations for theprotection of laboratory animals.MALDI MSI. MSI data was collected by sampling the tissue

section with 50 μm raster arrays without laser movement withineach measuring position. The dimensions of the measured livertissue section was approximately 0.9 by 1.2 cm in x, y samplingcoordinates. A total of 23 823 sampling positions (x = 247, y =181) were collected. Full mass spectra were collected using aMALDI LTQ Orbitrap XL mass spectrometer (Thermo FisherScientific, Bremen, Germany), equipped with a 60 Hz 337 nmnitrogen pulse laser (LTB Lasertechnik Berlin, Berlin,Germany). This instrument was operated at 60 000 resolution(at m/z 400) collecting spectral data in the mass range of 150−1000 m/z in profile mode generated by 20 laser shots at 10 μJwith automatic gain control switched off. Data were acquiredusing Xcalibur v 2.0.7. software (Thermo Fisher Scientific, SanJose, CA). The MSI raw data contains mass spectra from allmeasurement points together with their x, y coordinates.The Thermo Scientific raw files were first converted tomzML

using msconvert and then to imzML11 format usingimzmlConverter. Finally, the imzML data was loaded intoMATLAB and analyzed with custom scripts. The mouse bladderdata set with PXD001283 ID was downloaded fromProteomeXchange in imzML format.Peak Picking.We propose a two-step peak picking scheme:

in the first step, candidate peaks are detected on clusters of peakm/z values from all spectra, and in the second, the spatialdistribution of the candidate peaks is evaluated and we selectthose that display a coherent structure. For the first step, we havedevised a novel method that relies on clusterwise kernel densityestimates (KDEs) of spectral peaks. KDEs are smoothhistograms and we use them to estimate the distribution ofthe peakm/z values within clusters along them/z axis. The levelof smoothness is adapted to each cluster independently.

Candidates of data set peaks are then detected as local maximaon the resulting KDE curves. For the second step, we use twoestablished ways to automatically estimate the quality of theimages corresponding to peaks obtained in the first step as ameans to filter out noninformative peaks. Figure 1 summarizesall parts of our peak picking scheme.

PeakDetection. First, we collect the peakmasses from everyspectrum in one list, mzall, which is then sorted in ascendingorder. Centroided spectra are taken as input and peaks withheights below a very low intensity threshold are discarded toreduce the impact of background noise. Consequently,mzall willcontain most peak masses from the data set. Depending on dataset size and RAM availability mzall is processed either insegments or in its entirety. Second, peak clusters in the m/zdimension are identified using a one-dimensional directionalgraph. If the distance between an m/z value, mi, and the next,mi+1, is smaller than dc, an edge connecting the two is added tothe graph. The connected components in the resulting graphrepresent the m/z clusters. We let dc increase with m/z toaccount for the peak broadening described by the knowntheoretical relationship between peak width (at half-maximum)and m/z: dc = f(m/z) where f depends on instrument type.12

Suits et al.13 summarized the relationship between peak widthand instrument type. To reduce processing time, we discardclusters containing fewer than a minimum number of peaks. Thethreshold should be set sufficiently low to retain peaksrepresenting meaningful anatomical structures in the tissueand is therefore dependent on the spatial resolution of theexperiment. Finally, to test whether a cluster contains one ormore peaks, a KDE is fitted to the distribution of m/z valueswithin the cluster. The kernel bandwidth is optimized for eachcluster individually using the normal optimal smoothingmethoddescribed by Bowman and Azzalini.14 Peaks are detected on theKDE curve in an iterative fashion: first the local maxima aredetected and added together with their corresponding heights toa cluster-specific peak list, pkde. The m/z corresponding to thehighest peak in this list, mzmax, is added to the global peak list,mzref, and all surrounding peaks in pkde, that fall within dkdeincluding mzmax, are removed. This step is repeated until pkde isempty. The parameter dkde is proportional to the expected peakwidth of the instrument in the same manner as dc. The ionimages are then generated by aligning each centroided spectrumto the resulting reference spectrum mzref, using a nearestneighbor method with maximum drift threshold dependent onthe expected theoretical peak width (at half-maximum),similarly to the threshold used when generating edges betweenpeaks in the clustering step.

Peak Selection. Although our method is more directed thanslicing the spectra across them/z range (since it only considers aselection of the m/z regions), it still generates many peaksrepresenting noise in addition to those correlated with actualtissue structures, making it essential to separate the former fromthe latter. We use the spatial chaos8 (SC) and the principalcomponent analysis (PCA)-based variance explained15 (VE)measures to automatically estimate the level of structure in theion images. The spatial chaos counts the number of connectedobjects in an ion image.More structured ion images are expectedto have fewer disconnected (separate) objects than unstructuredones. The VE measure is the percentage of total varianceexplained by the first pair of singular vectors of each ion image.This corresponds to howmuch of the variation in intensity alongone axis of the image is explained by the intensities along theother. The first principal component inherently explains the

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11890

Page 5: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

most variance and, thus, if it explains very little, so will all others.In structured images there is typically an intensity relationshipbetween the axes and therefore their VE is expected to be higherthan that of images with randomly distributed intensities, i.e.,unstructured images, in which this relationship is unlikely toexist.

■ RESULTS AND DISCUSSION

Two data sets were used to assess the performance of our novelMSI data preprocessing algorithm based on clusterwise peakdetection. The first MALDI-MSI data set (referred to as the”spiked data set”) was generated by spiking a rat liver sectionwith 5 mixtures of 4 ground-truth drugs (12 differentcompounds in total) in various concentrations. These mixtureswere spotted on a rat liver tissue section at five different locationsin circular areas of the same size (Figure S1) and, after matrix

deposition, the whole tissue section was analyzed by MALDI-MSI using 50 μm spatial resolution. The concentrations of thedrug compounds covered an intensity range of 3 orders ofmagnitude between trametinib (1.70 × 104) and ipratropium(1.49× 107). Furthermore, some of the ground-truth drugs suchas erlotinib and dasatinib, were spotted at multiple loca-tions in different concentrations. The second data set, originallyfrom Rompp et al.,10 comes from a mouse bladder section andwas downloaded from ProteomeXchange (XD001283). ThisMSI data set was generated by a LTQ Orbitrap instrument withan ion source built in-house used to scan the mouse bladdersection with 10 μm spatial resolution. The authors of this studypresented the ion images of 11 compounds. These images weregenerated with a narrow bin width of 0.01 Da. For this data set,we use the mass of these compounds as ground truth, i.e., peaksknown to be present.

Figure 2. (a) The distribution ofm/z peak values within the cluster containing erlotinib (m/z 394.176). (b−e) The ion images that correspond to thefour peaks on the KDE curve. (f) The ion image obtained by binning the spectra between 394.15 and 394.20 m/z; this image demonstrates how foursignals can be mixed in the same ion image and even when a relatively narrow m/z window is used.

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11891

Page 6: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

Recall of Known Compounds. We applied Cardinal,MALDIquant, slicing the spectra into 0.05 Da bins, and ourclusterwise peak detection method to the spiked data set tocompare their ability to recall compounds. The differencebetween the known mass of each ground-truth drug and themass of the closest detected peak is used as the measure ofaccuracy for Cardinal and our method. The ion imagescorresponding to the monoisotopic peak of the ground-truthdrugs were manually evaluated to confirm that a compound hadbeen correctly found. First, we ran Cardinal and detected 4751peaks; we did not filter out those with too low pixel frequency.The corresponding ion images were generated by binningaround each peak. Eight of the 12 compounds were detectedwith a mass deviation ranging between 4.23 and 198.85 ppm(mean 83.983 ppm). Figure S2 shows the ion images of the drugcompounds generated by Cardinal. The ion images of erlotinib(394.176 Da) and geftinib (447.160 Da) are contaminated withsignal from other compounds while sunitinib (399.220 Da),imatinib (494.267 Da), and trametinib (616.086 Da) arecompletely missed. Second, we usedMALDIquant to compute amean spectrum on which we detected 521 peaks. Only the peakfrom the drug with the highest measured intensity, ipratropium,was found with a mass deviation of 4.7145 ppm. The ion imagecorresponding to the monoisotopic peak of iptratropiumindicates that this compound has diffused from the spottinglocation and because of this covers a significantly larger region of

the tissue than the other compounds; this might contribute to itspresence in the mean spectrum which favors signals that havehigh intensity and/or pixel frequency. Third, we sliced thespectra with a bin width of 0.05 Da across the 150−1000 m/zrange resulting in 17 000 slices. To asses the sensitivity of theslicing approach we manually examined the ion imagescorresponding to the slices containing the m/z of the spiked-in drug compounds (Figure S3). The signal from trametinib(616.086) is missed and those from erlotinib (394.176 Da) andimatinib (494.267 Da) are mixed with others, resulting incontaminated ion images. Finally, when applying our method,we identified 3148 m/z clusters in the data set peak list and onthe KDEs of these we detected 6088 peaks. We used a value of0.2 times the theoretical peak width at half-maximum for dc, theparameter controlling the maximum distance between con-nected points that form the m/z clusters. Decreasing orincreasing dc between 0.1 and 0.5 results in a higher or lowernumber of clusters, respectively, but ultimately has little impacton the final peak list. All of the 12 spiked-in compounds aredetected with mass deviations ranging between 1.00 and 4.29ppm (mean 2.598 ppm). Figure S4 shows the ion imagescorresponding to the monoisotopic peaks of the drugcompounds generated by our method. The signal fromtrametinib is weak but detected nevertheless; it had the lowestmeasured intensity which can explain its absence in some of thespectra. Generally, the quality of images generated with our

Figure 3. Distribution of peak m/z values within the cluster containing PC (32:1) (770.5109 m/z) and SM(18:0) (770.5609 m/z). The ion imagescorresponding to the two highest peaks on the KDE curve are shown in the bottom left and bottom right.

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11892

Page 7: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

approach is higher than that of the images generated withCardinal or by slicing. The drug signals are clearly visible againstthe background, and there is no contamination with signals fromother compounds, background, or matrix. Table S1 shows themass deviations of the detected peaks corresponding to thespiked-in drugs obtained with Cardinal and our algorithm. Thecorresponding ion images are shown in Figure S2 and Figure S4,respectively.An example of a cluster with densely located molecule signals

is that containing erlotinib (394.176 Da) (Figure 2a). There arefour distinctive signals within this relatively narrowm/z window(0.04 Da) at 394.161, 394.166, 394.172, and 394.176 m/z withinterpeak distances of 13, 15, and 10 ppm. The peak at 394.161m/z is tissue-derived while those at 394.166 m/z and 394.172come from a fragment molecule of imatinib and the matrix,respectively. Using our method we are able separate the fourpeaks and generate a clean image for each of them. Figure 2b−eshows the ion images related to these peaks. If the spectra arebinned between 394.150 and 394.200 m/z instead, the signalsfrom three of the four compounds appear in the same ion image,i.e., they are incorrectly combined into one ion-image while thatfrom the peak at 394.172 m/z is invisible (Figure 2f) due to itslow intensity compared to the other three. We found that a value

between 0.25−0.5 times the theoretical peak width at half-maximum is a good choice for dkde, the parameter controlling theminimum distance between two adjacent peaks on the KDEcurve. Using a higher value results in fewer noise peaks, however,we lose true peaks, e.g., those from imatinib and erlotinib.Because of this, we recommend using a small dkde to delayfiltering out noise peaks until after alignment by using one of thespatial distribution based peak selection methods. The kernelbandwidth used when generating the cluster KDEs is optimizedfor each cluster individually to account for the variability in peakdensity. This parameter determines the level of smoothing whenestimating the distribution of the peak masses within theclusters. Similarly to dkde, using a higher bandwidth results in lessnoisy data, however, may lead to losing true peaks or mixingsignals from multiple compounds.We also applied our cluster-based peak detection method to

the high spatial resolution mouse bladder data set. In this dataset we detected 1702 m/z clusters and 6482 peaks. We thenfiltered out peaks which were present in fewer than 200 of the33 000 spectra, resulting in a final list of 1024 data set peaks. Theoriginal paper reported 11 ion images that were manuallygenerated by binning around peaks with knownm/z using a verynarrow bin width of 0.01 Da. All peaks corresponding to these

Figure 4. Number of ion images surviving varying thresholds on the VE and SC scores in the two data sets. Dashed lines mark the lowest scores(excluding the low quality image form/z 616.127) of the ion images corresponding to the drugs in the spiked data set (top) and known compounds inthe mouse bladder data set (bottom).

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11893

Page 8: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

ion images are found by our peak detection method in anunsupervised fashion, including the two densely located peaks at770.5097 and 770.5698 m/z originating from the K+ adduct ofPC(32:1) [phosphatidylcholine] and an isotope of the K+

adduct of SM(36:1), [sphingosylphosphorylcholine], respec-tively (Figure 3). Figure S5 shows the ion images related to the11 detected peaks.Peak Selection. As previously mentioned, we find more

than 6000 peaks in the rat liver data set with our cluster-basedpeak detection, resulting in an equal number of ion images.Manually evaluating each image is impractically slow, but bycomputing the spatial chaos (SC) and the variance explained(VE) for all ion images, including those of the compoundsknown to be present, we can estimate how much we can reducethe number of images without losing relevant information. Foreach data set, we took the VE and SC scores of the ion imagescorresponding to the known compounds and used their meanscores minus two standard deviations as low-end thresholds.The number of peaks whose images had scores above thesethresholds indicates how many of the detected peaks should bekept and how many can be rejected as noise. In the spiked dataset this filtering resulted in a final list of 843 and 2170 peakswhen we filtered based on VE and SC scores, respectively. Thenumbers of peaks obtained for the mouse bladder data set are418 and 288 for VE and SC, respectively. The number of ionimages whose VE or SC score is above various thresholds isshown in Figure 4. The number of peaks can potentially befurther reduced if off-tissue regions are available; biologicallyirrelevant peaks, such as those coming from solvents or the

matrix, can be filtered out since their signal often is stronger inthese regions.15

Despite its simplicity, the VE score proved to be very effectivein ranking the quality of the ion images generated from both thespiked and mouse bladder data sets. Specifically, VE favorsimages which have intensities localized to small regions, e.g., allof the spiked-in compounds in the spiked data set and heme b,M+ at m/z = 616 (Figure S5c) in the mouse bladder data set. Incontrast, ion images with high levels of structure across theentire scanned region tend to be rewarded with the highest SCscores, making it suitable as a general measure of image qualitybut less effective than the VE score in identifying ion images withlocalized structured intensity patterns. The two scores appearedto be partially complementary to each other; the Pearsoncorrelation between the VE and SC scores in the spiked andmouse bladder data sets were 0.6158 and 0.4821, respectively.Tables 1 and 2 show the VE and SC scores of the ion imagescorresponding to the ground truth compounds in the spiked andmouse bladder data sets, respectively.

Detection of Fragments and Isotopes. MALDI-MSI isan important tool often used to investigate the distribution ofdrugs and drug metabolites in tissue during pharmaceuticalresearch, and obtaining comprehensive lists of interactingmolecules is crucial during their development. To this end, wefurther assessed the performance of our peak detection methodby searching for molecules colocalized with the drugs in thespiked data set. Colocalization analysis can be performed bycomputing the Pearson correlation coefficient between the ionimage of a peak of interest and all other images.5,16,17 For each

Table 1. VE and SC Scores of the Ion Images Corresponding to the Spiked-in Drug Compound in the Spiked Data Set and TheirCorresponding Rank among the 4771 Ion Images That Remain after Removing Those with Fewer Than 400 Nonzero Pixels

compound mass VE percentile rank (VE) SC percentile rank (SC)

ipratropium 332.223 0.5997 99.43 27 0.9997 99.94 3vatalanib 347.107 0.7183 99.79 10 0.9952 79.29 988erlotinib 394.177 0.7837 99.85 7 0.9775 61.04 1859sunitinib 399.220 0.6845 99.73 13 0.9921 72.23 1325pazopanib 438.171 0.8853 99.98 1 0.9837 64.60 1689gefitinib 447.160 0.8362 99.92 4 0.9948 78.22 1039sorafenib 465.094 0.8328 99.90 5 0.9951 79.04 1000dasatinib 488.164 0.6400 99.62 18 0.9980 92.10 377imatinib 494.267 0.7611 99.81 9 0.9766 60.64 1878dabrafinib 520.109 0.5499 97.78 106 0.9964 83.29 797lapatinib 581.143 0.6715 99.69 15 0.9775 60.97 1862trametinib 616.086 0.1696 70.72 1397 0.9038 53.07 2239

Table 2. VE and SC Scores of the Ion Images Corresponding to the 11 Compounds Reported by Ro mpp et al.10 and TheirCorresponding Rank among the 1053 Candidate Ion Images That Remain after Removing Those with Fewer Than 200 NonzeroPixels

compound mass VE percentile rank (VE) SC percentile rank (SC)

LPC (16:0), [M + K]+ 535.296 0.1770 92.76 74 0.9897 94.52 56LPC (18:0), [M + K]+ 562.327 0.2732 98.14 19 0.9964 99.12 9heme b, M+ 616.177 0.2385 96.67 34 0.9261 70.84 298unknown 713.452 0.0911 75.93 246 0.9444 73.68 269SM (16:0) 742.531 0.2140 95.50 46 0.9953 98.24 18unknown 743.548 0.1921 94.42 57 0.9691 84.34 160PC(32:1), [M + K] 770.507 0.2688 97.95 21 0.9814 88.85 114SM(18:0), [M + K] 770.565 0.1439 87.87 124 0.9849 90.90 93PC (32:0),[M + K]+ 772.525 0.3177 98.83 12 0.9975 99.80 2PC (34:1), [M + K]+ 798.541 0.3383 99.02 10 0.9979 99.90 1PE(38:1) 812.557 0.1623 91.39 88 0.9909 95.21 49

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11894

Page 9: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

drug compound, we computed the correlation coefficientbetween the ion image corresponding to its monoisotopicpeak and every ion image from the full image sets generatedusing the peaks found with our clusterwise peak detectionmethod and that generated by slicing, without performing peakfiltering based on spatial distribution. We manually assessedimages whose correlation coefficient was ≥0.5 to search forcandidate fragments and isotopes with spatial intensitydistributions matching those of the drugs. The m/z of thematching images and existing knowledge about the theoreticalfragmentation pattern of the drugs were then used to identify thefragments. This resulted in the identification of 46 isotopes andfragments in the ion image set generated by our method and 32in the set generated by slicing. We gain an additional 14fragments and isotopes when using our peak detection approachcompared to when slicing the spectra with a bin width of 0.05Da.The correlation analysis result of dasatinib is shown in Figure

5. In total, 12 ion images have a correlation coefficient≥0.5. Thenine most correlated images (≥0.75) consist of three isotopes ofdasatinib with an m/z of 489.165, 490.159, and 491.162, and six

fragments with an m/z of 319.133, 387.078, 401.094, 402.097,403.091, and 427.110. The fragments’ and isotopes’ ion imagesshow minimal signal mixing with other compounds as shown inFigure 5. The remaining three consist of another fragment ofdasatinib with anm/z of 429.106 and a correlation coefficient of0.5422 and two ion images related to sorafinib. The indentifiedfragments and results of the correlation analysis are presented inSupporting Information spreadsheet and Figures S6−S16. Wealso assessed the most anticorrelated images to investigatewhether there was evidence of ion suppression from any of theground-truth drugs. However, no images uniquely anticorre-lated to any one of the spiking spots were found. Instead, theseimages were anticorrelated to all spiking spots simultaneously,indicating that they are the result of washing or ion suppressionfrom the solvent used in the drug mixtures.

■ CONCLUSIONS

In this paper we have presented an efficient peak pickingapproach combining a novel peak detection algorithm withfiltering based on spatial information to automatically identifyion images corresponding to isotopic peaks of both endogenous

Figure 5.Top: The ion images of the 12most correlated peaks to dasatinib’s monoisotopic peak. Panels a−i and l are isotopes or fragments of dasatinibwhile panels j and k are related to sorafenib. Bottom: Sorted Pearson correlation between all ion images and that of the monoisotopic peak of dasatinib.

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11895

Page 10: Clusterwise Peak Detection and Filtering Based on Spatial … · 2019. 9. 30. · Clusterwise Peak Detection and Filtering Based on Spatial Distribution To Efficiently Mine Mass Spectrometry

and drug compounds in high-resolution MSI data sets. It shouldbe noted that these data sets were generated using high-resolution Orbitrap MSI, which is low-pass-filtered duringacquisition by default. Applying our method to noisier data suchas that generated by QTOF MSI would require additionalpreprocessing such as baseline removal and smoothing. OurKDE clusterwise peak detection algorithm enables us to find lowintensity and localized peaks with minimal contamination fromother peaks close inm/z, resulting in high ion image quality. Webelieve that implementing our MSI preprocessing algorithm inan interactive tool would be valuable to experimentalists whoaim to identify a priori unknown endogenous compounds, revealdrug distributions in tissue, or find compounds that spatiallycorrelate to known ones. Such a tool could help users gaindeeper insight into the effect of drugs in tissue and considerablyreduce the number of ion images that have to be examinedmanually.

■ ASSOCIATED CONTENT*S Supporting InformationThe Supporting Information is available free of charge on theACS Publications website at DOI: 10.1021/acs.anal-chem.9b02637.

Methods and figures (PDF)Tables of correlating peaks for each spiked-in compoundwith structures and annotations (isotopes, fragments) andthe description of the 5 drug mixtures (XLSX)Structures of spiked-in drugs (PDF)

■ AUTHOR INFORMATIONCorresponding Author*E-mail: [email protected] Rezeli: 0000-0003-4373-5616Peter Horvatovich: 0000-0003-2218-1140NotesThe authors declare no competing financial interest.

■ ACKNOWLEDGMENTSWe thank Frank Suits for his support and insightful discussionsthroughout the project and we kindly acknowledge the supportfrom Fru Berta Kamprads Stiftelse.

■ REFERENCES(1) Jones, E. A.; Deininger, S.-O.; Hogendoorn, P. C.; Deelder, A. M.;McDonnell, L. A. J. Proteomics 2012, 75, 4962−4989.(2) Gessel, M.M.; Norris, J. L.; Caprioli, R. M. J. Proteomics 2014, 107,71−82.(3) Bemis, K. D.; Harry, A.; Eberlin, L. S.; Ferreira, C.; van de Ven, S.M.; Mallick, P.; Stolowitz, M.; Vitek, O. Bioinformatics 2015, 31, 2418−2420.(4) Gibb, S.; Strimmer, K. Bioinformatics 2012, 28, 2270−2271.(5) Suits, F.; Fehniger, T. E.; Vegvari, A.; Marko-Varga, G.;Horvatovich, P. Anal. Chem. 2013, 85, 4398−4404.(6) Alexandrov, T.; Bartels, A. Bioinformatics 2013, 29, 2335−2342.(7) Wijetunge, C. D.; Saeed, I.; Boughton, B. A.; Spraggins, J. M.;Caprioli, R. M.; Bacic, A.; Roessner, U.; Halgamuge, S. K. Bioinformatics2015, 31, 3198−3206.(8) Palmer, A.; Phapale, P.; Chernyavsky, I.; Lavigne, R.; Fay, D.;Tarasov, A.; Kovalev, V.; Fuchser, J.; Nikolenko, S.; Pineau, C.; Becker,M.; Alexandrov, T. Nat. Methods 2017, 14, 57.(9) Inglese, P.; Correia, G.; Takats, Z.; Nicholson, J. K.; Glen, R. C.Bioinformatics 2019, 35, 178−180.

(10) Rompp, A.; Guenther, S.; Schober, Y.; Schulz, O.; Takats, Z.;Kummer, W.; Spengler, B. Angew. Chem., Int. Ed. 2010, 49, 3834−3838.(11) Schramm, T.; Hester, A.; Klinkert, I.; Both, J.-P.; Heeren, R. M.;Brunelle, A.; Laprevote, O.; Desbenoit, N.; Robbe, M.-F.; Stoeckli, M.;Spengler, B.; Rompp, A. J. Proteomics 2012, 75, 5106−5110.(12) Hoffman, E. D.; Stroobant, V. West Sussex; John Wiley & Sons,Bruxellas, Belgica, 2007, 1, 85.(13) Suits, F.; Hoekman, B.; Rosenling, T.; Bischoff, R.; Horvatovich,P. Anal. Chem. 2011, 83, 7786−7794.(14) Bowman, A. W.; Azzalini, A. Applied smoothing techniques for dataanalysis: the kernel approach with S-Plus illustrations; OUPOxford, 1997;Vol. 18.(15) Fonville, J. M.; Carter, C.; Cloarec, O.; Nicholson, J. K.; Lindon,J. C.; Bunch, J.; Holmes, E. Anal. Chem. 2012, 84, 1310−1319.(16) Nemes, P.; Woods, A. S.; Vertes, A. Anal. Chem. 2010, 82, 982−988.(17) Fehniger, T. E.; Suits, F.; Vegvari, A.; Horvatovich, P.; Foster, M.;Marko-Varga, G. Proteomics 2014, 14, 862−871.

Analytical Chemistry Article

DOI: 10.1021/acs.analchem.9b02637Anal. Chem. 2019, 91, 11888−11896

11896