Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre...

9
Exploring hyperspectral imaging data sets with topological data analysis Ludovic Duponchel LASIR CNRS UMR 8516, Universit e Lille 1, Sciences et Technologies, 59655 Villeneuve dAscq Cedex, France highlights graphical abstract First use of topological data analysis in spectroscopic imaging. Detection of minor compounds in a multiphase chemical system. TDA: a new paradigm for cluster analysis in the framework of imaging. article info Article history: Received 2 October 2017 Received in revised form 16 November 2017 Accepted 17 November 2017 Available online xxx Keywords: Hyperspectral imaging Spectroscopic imaging Topological data analysis Clustering Raman spectroscopy abstract Analytical chemistry is rapidly changing. Indeed we acquire always more data in order to go ever further in the exploration of complex samples. Hyperspectral imaging has not escaped this trend. It quickly became a tool of choice for molecular characterisation of complex samples in many scientic domains. The main reason is that it simultaneously provides spectral and spatial information. As a result, che- mometrics has provided many exploration tools (PCA, clustering, MCR-ALS ) well-suited for such data structure at early stage. However we are today facing a new challenge considering the always increasing number of pixels in the data cubes we have to manage. The idea is therefore to introduce a new paradigm of Topological Data Analysis in order explore hyperspectral imaging data sets highlighting its nice properties and specic features. With this paper, we shall also point out the fact that conventional chemometric methods are often based on variance analysis or simply impose a data model which implicitly denes the geometry of the data set. Thus we will show that it is not always appropriate in the framework of hyperspectral imaging data sets exploration. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Hyperspectral imaging is today a choice tool for characterising complex samples in different scientic domains. It is obvious that instrumental developments have rst contributed to this potential but multivariate data analysis tools have really revealed its potential. Many chemometric algorithms have been developed in order to explore hyperspectral data cubes without a priori such as Principal Component Analysis (PCA), clustering techniques, Multi- variate Curve Resolution-Alternating Least Squares (MCR-ALS) but we must admit that we are always more hampered by the increasing number of pixels (i.e. spectra) in our data sets. Although these tools allow us to extract valuable information about major compounds, it is still difcult to extract information about minor ones, all the more so due to the bad signal to noise ratio we often E-mail address: [email protected]. Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca https://doi.org/10.1016/j.aca.2017.11.029 0003-2670/© 2017 Elsevier B.V. All rights reserved. Analytica Chimica Acta xxx (2017) 1e9 Please cite this article in press as: L. Duponchel, Exploring hyperspectral imaging data sets with topological data analysis, Analytica Chimica Acta (2017), https://doi.org/10.1016/j.aca.2017.11.029

Transcript of Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre...

Page 1: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

lable at ScienceDirect

Analytica Chimica Acta xxx (2017) 1e9

Contents lists avai

Analytica Chimica Acta

journal homepage: www.elsevier .com/locate/aca

Exploring hyperspectral imaging data sets with topological dataanalysis

Ludovic DuponchelLASIR CNRS UMR 8516, Universit�e Lille 1, Sciences et Technologies, 59655 Villeneuve d’Ascq Cedex, France

h i g h l i g h t s

E-mail address: [email protected].

https://doi.org/10.1016/j.aca.2017.11.0290003-2670/© 2017 Elsevier B.V. All rights reserved.

Please cite this article in press as: L. Duponch(2017), https://doi.org/10.1016/j.aca.2017.11.0

g r a p h i c a l a b s t r a c t

� First use of topological data analysisin spectroscopic imaging.

� Detection of minor compounds in amultiphase chemical system.

� TDA: a new paradigm for clusteranalysis in the framework of imaging.

a r t i c l e i n f o

Article history:Received 2 October 2017Received in revised form16 November 2017Accepted 17 November 2017Available online xxx

Keywords:Hyperspectral imagingSpectroscopic imagingTopological data analysisClusteringRaman spectroscopy

a b s t r a c t

Analytical chemistry is rapidly changing. Indeed we acquire always more data in order to go ever furtherin the exploration of complex samples. Hyperspectral imaging has not escaped this trend. It quicklybecame a tool of choice for molecular characterisation of complex samples in many scientific domains.The main reason is that it simultaneously provides spectral and spatial information. As a result, che-mometrics has provided many exploration tools (PCA, clustering, MCR-ALS …) well-suited for such datastructure at early stage. However we are today facing a new challenge considering the always increasingnumber of pixels in the data cubes we have to manage. The idea is therefore to introduce a new paradigmof Topological Data Analysis in order explore hyperspectral imaging data sets highlighting its niceproperties and specific features. With this paper, we shall also point out the fact that conventionalchemometric methods are often based on variance analysis or simply impose a data model whichimplicitly defines the geometry of the data set. Thus we will show that it is not always appropriate in theframework of hyperspectral imaging data sets exploration.

© 2017 Elsevier B.V. All rights reserved.

1. Introduction

Hyperspectral imaging is today a choice tool for characterisingcomplex samples in different scientific domains. It is obvious thatinstrumental developments have first contributed to this potentialbut multivariate data analysis tools have really revealed its

el, Exploring hyperspectral im29

potential. Many chemometric algorithms have been developed inorder to explore hyperspectral data cubes without a priori such asPrincipal Component Analysis (PCA), clustering techniques, Multi-variate Curve Resolution-Alternating Least Squares (MCR-ALS) butwe must admit that we are always more hampered by theincreasing number of pixels (i.e. spectra) in our data sets. Althoughthese tools allow us to extract valuable information about majorcompounds, it is still difficult to extract information about minorones, all the more so due to the bad signal to noise ratio we often

aging data sets with topological data analysis, Analytica Chimica Acta

Page 2: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e92

observe in molecular spectroscopy. It is in this sense that we needto explore new paradigms such as topological data analysis.

Mathematicians usually use topology in order to study shape ofabstract objects. Nevertheless they have discovered a short timeago that it could be used for the exploration of real-world data setsand topological data analysis (TDA) was born [1e4]. Since then,many papers have demonstrated the great potential of the concept.The first domain taking advantage of the method is certainlybiology at large.We thus findmany papers in genomics [5] with, forexample, the development of a new visualisation tool for expres-sion Quantitative Trait Locus (eQTL) [6], analysis of immuneresponse [7,8], viral evolution [9], antibiotic resistance [10], bacteria[11], protein interaction networks [12,13], intestinal developmentin preterm and term infants [14], DNA repairing [15], evolutionaryprocesses [16,17] and cellular development [18]. TDA is also usedfor the study of different diseases such as pulmonary embolism[19], diabetes [20,21], autism [22], asthma [23e25], infections[26,27], cardiac septal defects and cardiomyopathy [28], ovariancancer [29], precision oncology [30], knee osteoarthritis [31], oralsquamous cell Carcinomas [32] and chronic fatigue syndrome [33].It is also used in neuroscience for the study of the activity in thevisual cortex [34], the analysis of traumatic brain injury [35,36] andthe identification of neuroimaging biomarkers for patients withserious mental illness [37]. The first paper in analytical chemistry isdedicated to the analysis of more than 20.000 samples in order toreveal pedogenetic principles of European topsoil system [38]. Thisarticle demonstrates the good scalability of the approach able toexplore and summarize quite big data sets. Another very inter-esting paper is focused on finding the best nanoporous materialsfor gas storage [39]. It was not until 2016 that a first TDA paper ispublished in physical-chemistry and, more specifically, in molecu-lar spectroscopy [40]. This article is dedicated to Raman analysis ofsingle bacteria. It is then demonstrated that TDA is able to classifyspectra of different bacteria strains considering different experi-mental conditions wherever classical clustering methods fail. Wecome to realize that TDA seems to have nice properties for theanalysis of spectroscopic data sets. With this in mind, proposing anarticle dedicated to the development of TDA for the analysis ofhyperspectral data cubes is a natural extension. The first part of thepaper will introduce the TDA concept in the framework of hyper-spectral imaging and provide details about the data set used in thisstudy. A second part will give details about TDA network con-struction and results concerning clustering with TDA comparedwith Kmeans algorithm.

2. Materials and methods

2.1. Image data set

The paper of Andrew and al [41]. is the origin of the hyper-spectral data set we use in this study. An oil in-water emulsionsystem is then explored with Raman spectroscopy. The complexityof this multicomponent/multiphase system explains the fact thatwe have selected it. Moreover it has been already explored indifferent chemometric papers which is very interesting for com-parison purpose [42,43]. The hyperspectral data set contains 3600preprocessed spectra (i.e. a data cube of 60 pixels by 60 pixels) witha spectral range from 950 cm�1 to 1800�1and thus 253 spectralvariables. Spectral preprocessing has been used prior data analysisin order to suppress fluorescence effects.

2.2. Topological data analysis of a hyperspectral data cube

The main aim of TDA is first to generate a topological networkwhich represents the intrinsic shape of the explored data set. Fig. 1

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

presents its use in the framework of the analysis of a hyperspectraldata cube. Because, like most of the chemometric methods, TDA isnot suited for a direct analysis of a 3D hyperspectral data cube, anunfolding procedure is used to generate a more convenient 2Dmatrix with the size (60 x 60, 253) considering the selected Ramandata set. The TDA is then decomposed in 8 different steps: (1) giventhe unfolded data set, we observe each row (i.e. spectrum) througha lens. In fact, all functions that produce a number from a spectrumcan be a lens. It may originate from different domains such asstatistics (min, max, mean, variance, density …), geometry (cen-trality, curvature…) and chemometrics (PCA scores, Support VectorMachine (SVM) distance from hyperplane …) without beingexhaustive. At the end of this step, we have a lens value per spec-trum of the data set and, by extension, a lens value scale. (2) Thenwe divide the lens scale into overlapping subsets. We will observethe impact of the number of subsets and the percentage of overlapin the ‘Results and discussion’ section of the paper. (3) Next we thenuse scale subsets to partition the data set. Because of overlaps, it ispossible to retrieve simultaneously a spectrum in different pixelsubsets. (4) In this step we consider each pixels subset separately.Indeed we apply a cluster analysis on each. In general single linkagealgorithm [44] is used but other techniques can be implemented. Atthis point starts the construction of the topological network. Eachcluster in each analysis is then represented by a node. (5) Weconnect nodes with edges when corresponding clusters have atleast one spectrum (i.e. pixel) in common. (6) Nodes are coloreddepending on the number of spectra they contain. (7) The networkis split in different subparts or groups of pixels considering varia-tions of nodes density and/or particular features of the networkshape. We generate here different classes of pixels. (8) In the laststep, a clustering map is generated considering the coordinates ofeach pixel in the sample plane and its class membership.

In this article, Topological Data Analysis was performedwith theAyasdi software platform (ayasdi.com, Ayasdi Inc., Menlo Park CA).An in-house Python script has been developed in connection withthe Ayasdi Python SDK in order to generate the clustering map.

2.3. Kmeans clustering

Many clustering methods have been used for the exploration ofhyperspectral data cubes. However this would not make any senseto compare TDA results with every possible approach. We havetherefore decided to select K-Means [45,46] (KM) clusteringbecause it is one of the most popular unsupervised classificationmethods. Very briefly, the goal of KM is to separate a set of nunlabelled data points (defined in a d-dimensional space) into kclusters. Each cluster is represented by its barycentre calledcentroid. As a first step, the algorithm selects at random k initialpoints as centroids in the d-dimensional space. We have then aniterative process defined in two steps: (i) given a chosen distance,each point of the data set is assigned to the nearest centroid; (ii)Considering this new partition, centroids are updated for eachcluster. These two steps are repeated until convergence, that is tosay when no more changes are observed in the assignment of alldata points. In this study, the point-to-centroid distance has beencalculated with the Euclidean one. Because partitions from KM al-gorithm are known to be very sensitive to the initialisation step,partitioning has been replicated 100 times. The idea here was torepeat clustering using each time new initial cluster centroid po-sitions. The partition having the lowest within-cluster sum ofpoint-to-centroid distances for all the data points is considered asthe optimal one. All KM calculations in this paper have beendeveloped with MATLAB environment version R2016a (The Math-Works Inc., Natick, MA, USA) and the Statistics and MachineLearning Toolbox version 10.2 (The MathWorks Inc., Natick, MA,

aging data sets with topological data analysis, Analytica Chimica Acta

Page 3: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

Fig. 1. TDA in the framework of the analysis of a hyperspectral data cube.

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e9 3

USA).

3. Results and discussion

In an earlier part of this paper, we have seen that networkconstruction is a prerequisite of the final clustering of the hyper-spectral data cube. Because the majority of published articlesdedicated to TDA often present a particular topological networkwithout information about its construction, we believe that it isimportant here to highlight the sensitive area of the concept.Indeed it is the best way to propose an objective assessment anddiscuss pros and cons. The first key point of TDA is the choice of thelens used for the first data set partitioning (step 1 in Fig. 1) and themetric used for cluster analysis (step 3 in Fig. 1). Fig. 2 shows to-pological networks obtained from the Raman data set usingdifferent combination of metrics and lenses. Two different lenses

Norm Angle ðx; yÞ ¼ cos�1Pn

i¼1xTransi yTransiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1

�xTransi

�2r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1

�yTransi

�2r and xTrani

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

have been used i.e. PCA scores along PC1 and PC2 as the first one(Fig. 2A and C) and another one called ‘Neighborhood’ (Fig. 2B andD). The ‘Neighborhood’ lens generates an embedding of high-dimensional data into two dimensions by embedding a k-nearestneighbors graph of the data. A k-nearest neighbors graph isgenerated by connecting each point to its nearest neighbors. Thisgraph is embedded in two-dimensions using Ayasdi's proprietarygraph layout algorithm. We have selected these two lenses becausethey are certainly the most widely used in all TDA papers andappear to be well-suited to most data structure. Two differentmetrics have also been used i.e. the well-known Euclidean distance(Fig. 2A and B) and another one called ‘Norm Angle’ (Fig. 2C and D).The ‘Norm Angle’ distance between two points x and y is the angledistance between the mean-centered, variance-normalized points.It is calculated according to eq. (1):

s ¼ ðxi �meaniÞStandardDeviationi

(1)

aging data sets with topological data analysis, Analytica Chimica Acta

Page 4: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

Fig. 2. Topological networks of the Raman data set considering different combination of metrics and lenses: A) lens: PCA scores, metric: Euclidean distance B) lens: Neighborhood,metric: Euclidean distance C) lens: PCA scores, metric: Norm Angle D) lens: Neighborhood, metric: Norm Angle.

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e94

with xi the Raman scattering at the spectral variable i of the spec-trum x and n the total number of variables. If we go back to Fig. 2,nodes in these topological networks represent groups of pixelswith similar spectra. We have also to remember that nodes arelinked when they have at least one pixel in common. Moreoverdense nodes containing a high number of spectra are representedby hot colours, while cold colours indicate a low number. Bearing inmind that the topological network has to be split in order togenerate the final clustering map, the one to be selected has tomake this task easier. It is also certainly the best way to generate aless biased partition. Considering networks using PCA scores as lens(Fig. 2A and B), we observe quite compact and continuous networkswith only two or three zones of dense nodes. It can hardly be saidthat we can distinguish real groupings. This result can be explainedby the fact that no real clusters are observed in the PC1/PC2 scoresscatterplot in Fig. 5B mainly due to the high density of points.Moreover the choice of the metric does not seem to change verymuch for these two cases. The situation is significantly differentwhen the ‘Neighborhood’ lens is used (Fig. 2B and D). Indeeddifferent clusters of varying density are now observed and theshape of the networks shows new features such as flairs or loops. Inthe perspective of the final splitting, the best topological network is

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

certainly the one using the ‘Neighborhood’ lens and the ‘NormAngle’ metric (Fig. 2D). This time it may be noted that the metrichas a real influence on the groupings. However considering the fournetworks, the key point seems to be the selection of the lens. In anutshell, the construction of topological network is a delicate taskbecause we need to find a good lens and metric combination.

Given this combination, two tuning parameters have also to bechosen to finalize the network construction. The first one is calledresolution. It is simply the number of overlapping subsets we havegenerated from the lens scale (Fig. 1, step 2). Fig. 3 shows differentnetworks using the ‘Neighborhood’ lens, the ‘Norm Angle’ metricand resolution varying from 10 to 40. Playing with resolution, it isthen possible to zoom in and out on the data set which is aninteresting and quite uncommon property in multivariate dataanalysis. At low resolution, we observe a rather coarse view of thedata set and it could be a good way to summarize it. Whenincreasing resolution, we generate more detailed networks. It isalso possible to observe more subparts of networks with contrast indensity of nodes and new features such as flairs and loops. How-ever, if it is too much increased, as in the case of 40 for this data set,a fragmentation of the network is observed with the result thatlinks and similarity information between subgroups are lost (black

aging data sets with topological data analysis, Analytica Chimica Acta

Page 5: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

Fig. 3. Influence of resolution on the topological network.

Fig. 4. Influence of gain on the topological network.

Please cite this article in press as: L. Duponchel, Exploring hyperspectral imaging data sets with topological data analysis, Analytica Chimica Acta(2017), https://doi.org/10.1016/j.aca.2017.11.029

Page 6: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e96

arrow in Fig. 3). In this specific task of TDA for hyperspectral im-aging, the author thinks that an intermediate resolution has to beselected to ensure the observation of the highest number of groupsand keep links between them. Thus a resolution of 30 seems to be agood compromise. A second tuning parameter is called gainwhichcontrols the overlap between overlapping subsets (eq. (2)).

Percentage of overlap ¼ 100��gain� 1gain

�(2)

Fig. 4 shows different networks using the ‘Neighborhood’ lens,the ‘Norm Angle’ metric, a resolution of 30 and gain varying from2 to 4.5 (i.e. an overlap between 50% and 78%). In this case,fragmented networks are observed for low gain and morecompact ones with more edges between nodes for higher values.With a gain of 3, a compromise is chosen for the same reasons asthose explained above. Once more, we need to find a goodcombination for the two tuning parameters in order to generatethe topological network. As a result, the topological network wehave used for the rest of the paper has been generated using the‘Neighborhood’ lens, the ‘Norm Angle’ metric, a resolution of 30and a gain of 3.

The main aim of this work being whether TDA could be well-suited for the exploration of hyperspectral data cubes or at leastprovide us another point of view, it was deemed important to

Fig. 5. Principal data analysis of the Raman data

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

compare its results with the ones obtained from conventionalchemometric methods like PCA and KM clustering. Fig. 5 shows aprincipal component analysis of the Raman data set with scoresmaps (part A) and scatterplots (part B). Considering scores mapsand corresponding principal components (not shown here), we cansay that the four first ones seem to be significant as already noted inprevious works [41e43]. Indeed this oil-in-water emulsion systemhas been described as two droplet phases, a structural phase and anaqueous one. It is interesting also to take a look at scores scatter-plots because they highlight an issue commonly observed in theanalysis of hyperspectral data sets. As we can see, the number ofpixels and the density of points are so high that we have a rathercontinuous cloud of points whatever the PCA hyperplane andalmost no clusters are really separated from each other. BecausePCA is not strictly speaking a clustering method, we show in Fig. 6the results of KM algorithm for a number of clusters varying from 4to 8. It is obvious that an unbiased vision of the sample is onlyobtained for an optimal number of clusters but from the author'sperspective there is no perfect criterion to validate the quality of apartition although many ones are proposed in literature [47e57].Therefore, the idea here is to have a look at all the proposed KMpartitions and compare them with other methods. Generallyspeaking, it would seem that KM clustering is able to describe thestructure of the different phases. However, a closer look reveals that

set. (A) Scores maps. (B) Scores scatterplots.

aging data sets with topological data analysis, Analytica Chimica Acta

Page 7: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e9 7

some details are left out. One can retrieve for example the highestcontributions of PC3 scores map (black arrows) in KMmaps using 6or 7 clusters. However lower contributions in the scores map(white arrows) are absolutely not retrieve with KM whatever thenumber of clusters (even above 8, not shown in the paper). Indeedthese contributions are really near the image noise level. Besidesthat, all these contributions of PC3 scores maps have been retrievedin a MCR component in a previous work [42]. Therefore it is now

Fig. 6. Kmeans clustering of the Raman data se

Fig. 7. TDA clustering of

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

interesting to examine the behaviour of TDA for the same clusteringtask. Fig. 7 shows the splitting of the TDA network in differentsubparts considering the variation of nodes density and geometricfeatures. Nodes of the network are coloured in black in order tohighlight the different groups of pixels we generate. It is true that atthis time the splitting is done in a manual way without using acriterion allowing us to automate the partition task. However theclustering map generated from this network is not a meaningless

t considering different number of clusters.

the Raman data set.

aging data sets with topological data analysis, Analytica Chimica Acta

Page 8: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e98

result. From a general point of view, the oil phase is defined by 4classes (groups 2, 3, 6 and 7) and 3 others for the aqueous medium(groups 1, 4 and 5). We can also observe that the two phases arewell separated in the topological network. Because overlappingsubsets of lens values are used for the network construction, it ispossible to find same pixels in different nodes. Then, it becomesimpossible to assign such pixels a particular class. As a conse-quence, pixels without a unique class membership are coloured inblack in the clusteringmap. It is truewe could additionally generatenew classes for these particular pixels having singular behavioursbut we have decided to simplify the message for this introductionpaper. Nevertheless, this feature should be considered in the futurebecause it brings interesting information about pixels. In a morespecific point of view, pixels group 3 reproduces perfectly what hasbeen observed with PCA and MCR. We also note with the group 6 amicron size layer between oil and water interfaces which is neitherobserved with PCA nor MCR. However with this paper, we do notclaim that TDA is a better method (always remembering the “nofree lunch” theorem [58e60]) but we are convinced it could giveanother point of view for the exploration of hyperspectral data sets.

4. Conclusion

The aim of the paper was to introduce the concept of topologicaldata analysis in order to explore hyperspectral imaging data sets.We were not so much interested in a precise and quantitativeevaluation of the concept compared with usual methods, but ratherin a more qualitative one highlighting its good properties. It wasalso an opportunity to propose an objective analysis which is rarelydone in TDA papers to date. First we have seen the importance oftopological network settings. Image analysis therefore requires theselection of good lens/metric combination. There is also, to a lesserextent, the search of a good compromise between resolution andgain. It is true that, for the moment, this optimisation is done in arather intuitive way. If possible, our first future work will be to findcriteria in order propose optimal settings in an automatic waywithout a priori. In the specific case of TDA for imaging, we havealso seen that the splitting of the topological network was neces-sary in order to generate the final clustering map. Once again, wewill have to find a criterion to automate this task tacking into ac-count variations of nodes density and geometric features of thenetwork. Although TDA is not yet optimised for imaging at thistime, we have observed interesting results demonstrating goodproperties for such data analysis. We also think that it has notrevealed all its potential. For example, minor compounds present ina low number of pixels of the cubes, could be better detected withTDA than with PCA based methods using variance. Shared pixelsbetween nodes could also give original information avoiding toassign them to a unique class when it is not the case. Even moreimportant, TDA has already proven its ability to manage big datasets in genomics analysis. At the same time, we all know that theanalysis of millions pixels data cube is still very tricky and greatprecautions must be taken in order generate unbiased maps. ThusTDA should provide a newpoint of view, for examplewith the zoomin/zoom out capability. We must not underestimate the big dataissue in chemometrics because we are going to face it soon if it isnot already there for some of our analysis.

In conclusion, based on these findings, we are convinced thatTDA could offer a lot to the chemometric community but there issome effort to be done for this.

References

[1] G. Carlsson, Topology and data, Bull. Am. Math. Soc. 46 (2009) 255e308.[2] P.Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson,

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

M. Alagappan, J. Carlsson, G. Carlsson, Extracting insights from the shape ofcomplex data using topology, Sci. Rep. 3 (2013), https://doi.org/10.1038/srep01236.

[3] G. Carlsson, Topological pattern recognition for point cloud data, Acta Numer.23 (2014) 289e368, https://doi.org/10.1017/S0962492914000051.

[4] G. Singh, F. M�emoli, G.E. Carlsson, Topological methods for the analysis of highdimensional data sets and 3d object recognition, SPBG (2007) 91e100.

[5] P.G. C�amara, Topological methods for genomics: present and future directions,Curr. Opin. Syst. Biol. 1 (2017) 95e101, https://doi.org/10.1016/j.coisb.2016.12.007.

[6] C.W. Bartlett, S.Y. Cheong, L. Hou, J. Paquette, P.Y. Lum, G. J€ager, F. Battke,C. Vehlow, J. Heinrich, K. Nieselt, R. Sakai, J. Aerts, W.C. Ray, An eQTL biologicaldata visualization challenge and approaches from the visualization commu-nity, BMC Bioinforma. 13 (Suppl 8) (2012).

[7] S. Schwartz, I. Friedberg, I.V. Ivanov, L.A. Davidson, J.S. Goldsby, D.B. Dahl,D. Herman, M. Wang, S.M. Donovan, R.S. Chapkin, A metagenomic study ofdiet-dependent interaction between gut microbiota and host in infants re-veals differences in immune response, Genome Biol. 13 (2012) r32.

[8] T. Lakshmikanth, A. Olin, Y. Chen, J. Mikes, E. Fredlund, M. Remberger,B. Omazic, P. Brodin, Mass cytometry and topological data analysis revealimmune parameters associated with complications after allogeneic stem celltransplantation, Cell Rep. 20 (2017) 2238e2250, https://doi.org/10.1016/j.celrep.2017.08.021.

[9] J.M. Chan, G. Carlsson, R. Rabadan, Topology of viral evolution, Proc. Natl.Acad. Sci. 110 (2013) 18566e18571.

[10] K.J. Emmett, R. Rabadan, Characterizing scales of genetic recombination andantibiotic resistance in pathogenic bacteria using topological data analysis, in:Int. Conf. Brain Inform. Health, Springer, 2014, pp. 540e551.

[11] A.M. Ibekwe, J. Ma, D.E. Crowley, C.-H. Yang, A.M. Johnson, T.C. Petrossian,P.Y. Lum, Topological data analysis of Escherichia coli O157:H7 and non-O157survival in soils, Front. Cell. Infect. Microbiol. 4 (2014), https://doi.org/10.3389/fcimb.2014.00122.

[12] M.E. Sardiu, J.M. Gilmore, B.D. Groppe, D. Herman, S.R. Ramisetty, Y. Cai, J. Jin,R.C. Conaway, J.W. Conaway, L. Florens, M.P. Washburn, Conserved abundanceand topological features in chromatin-remodeling protein interaction net-works, EMBO Rep. 16 (2015) 116e126, https://doi.org/10.15252/embr.201439403.

[13] M.E. Sardiu, J.M. Gilmore, B. Groppe, L. Florens, M.P. Washburn, Identificationof topological network modules in perturbed protein interaction networks,Sci. Rep. 7 (2017) 43845, https://doi.org/10.1038/srep43845.

[14] J.M. Knight, L.A. Davidson, D. Herman, C.R. Martin, J.S. Goldsby, I.V. Ivanov,S.M. Donovan, R.S. Chapkin, Non-invasive analysis of intestinal developmentin preterm and term infants using RNA-Sequencing, Sci. Rep. 4 (2015), https://doi.org/10.1038/srep05453.

[15] J.M. Gilmore, M.E. Sardiu, B.D. Groppe, J.L. Thornton, X. Liu, G. Dayebgadoh,C.A. Banks, B.D. Slaughter, J.R. Unruh, J.L. Workman, et al., Wdr76 co-localizeswith heterochromatin related proteins and rapidly responds to dna damage,PLoS One 11 (2016) e0155492, https://doi.org/10.1371/journal.pone.0155492.

[16] P.G. Camara, D.I.S. Rosenbloom, K.J. Emmett, A.J. Levine, R. Rabadan, Topo-logical data analysis generates high-resolution, genome-wide maps of humanrecombination, Cell Syst. 3 (2016) 83e94, https://doi.org/10.1016/j.cels.2016.05.008.

[17] P.G. C�amara, A.J. Levine, R. Rabad�an, Inference of ancestral recombinationgraphs through topological data analysis, PLoS Comput. Biol. 12 (2016)e1005071, https://doi.org/10.1371/journal.pcbi.1005071.

[18] A.H. Rizvi, P.G. Camara, E.K. Kandror, T.J. Roberts, I. Schieren, T. Maniatis,R. Rabadan, Single-cell topological RNA-seq analysis reveals insights intocellular differentiation and development, Nat. Biotechnol. 35 (2017) 551e560,https://doi.org/10.1038/nbt.3854.

[19] M. Rucco, L. Falsetti, D. Herman, T. Petrossian, E. Merelli, C. Nitti, A. Salvi, Usingtopological data analysis for diagnosis pulmonary embolism, ArXiv Prepr.ArXiv14095020. https://arxiv.org/abs/1409.5020, 2014. (Accessed 18 July2017).

[20] G. Sarikonda, J. Pettus, S. Phatak, S. Sachithanantham, J.F. Miller, J.D. Wesley,E. Cadag, J. Chae, L. Ganesan, R. Mallios, S. Edelman, B. Peters, M. von Herrath,CD8 T-cell reactivity to islet antigens is unique to type 1 while CD4 T-cellreactivity exists in both type 1 and type 2 diabetes, J. Autoimmun. 50 (2014)77e82, https://doi.org/10.1016/j.jaut.2013.12.003.

[21] L. Li, W.-Y. Cheng, B.S. Glicksberg, O. Gottesman, R. Tamler, R. Chen,E.P. Bottinger, J.T. Dudley, Identification of type 2 diabetes subgroups throughtopological analysis of patient similarity, Sci. Transl. Med. 7 (2015), https://doi.org/10.1126/scitranslmed.aaa9364, 311ra174e311ra174.

[22] D. Romano, M. Nicolau, E.-M. Quintin, P.K. Mazaika, A.A. Lightbody, H. CodyHazlett, J. Piven, G. Carlsson, A.L. Reiss, Topological methods reveal high andlow functioning neuro-phenotypes within fragile X syndrome, Hum. BrainMapp. 35 (2014) 4904e4915, https://doi.org/10.1002/hbm.22521.

[23] T.S.C. Hinks, X. Zhou, K.J. Staples, B.D. Dimitrov, A. Manta, T. Petrossian,P.Y. Lum, C.G. Smith, J.A. Ward, P.H. Howarth, A.F. Walls, S.D. Gadola,R. Djukanovi�c, Innate and adaptive T cells in asthmatic patients: relationshipto severity and disease mechanisms, J. Allergy Clin. Immunol. 136 (2015)323e333, https://doi.org/10.1016/j.jaci.2015.01.014.

[24] T.S.C. Hinks, T. Brown, L.C.K. Lau, H. Rupani, C. Barber, S. Elliott, J.A. Ward,J. Ono, S. Ohta, K. Izuhara, R. Djukanovi�c, R.J. Kurukulaaratchy, A. Chauhan,P.H. Howarth, Multidimensional endotyping in patients with severe asthmareveals inflammatory heterogeneity in matrix metalloproteinases and

aging data sets with topological data analysis, Analytica Chimica Acta

Page 9: Analytica Chimica Acta - Cloud Object Storage€¦ · Each cluster is represented by its barycentre called centroid. As a first step, the algorithm selects at random k initial points

L. Duponchel / Analytica Chimica Acta xxx (2017) 1e9 9

chitinase 3elike protein 1, J. Allergy Clin. Immunol. 138 (2016) 61e75, https://doi.org/10.1016/j.jaci.2015.11.020.

[25] J. Bigler, M. Boedigheimer, J.P.R. Schofield, P.J. Skipp, J. Corfield, A. Rowe,A.R. Sousa, M. Timour, L. Twehues, X. Hu, G. Roberts, A.A. Welcher, W. Yu,D. Lefaudeux, B.D. Meulder, C. Auffray, K.F. Chung, I.M. Adcock, P.J. Sterk,R. Djukanovi�c, A severe asthma disease signature from gene expressionprofiling of peripheral blood from U-BIOPRED cohorts, Am. J. Respir. Crit. CareMed. 195 (2017) 1311e1320, https://doi.org/10.1164/rccm.201604-0866OC.

[26] B.Y. Torres, J.H.M. Oliveira, A.T. Tate, P. Rath, K. Cumnock, D.S. Schneider,Tracking resilience to infections by mapping disease space, PLoS Biol. 14(2016) e1002436, https://doi.org/10.1371/journal.pbio.1002436.

[27] A. Louie, K.H. Song, A. Hotson, A. Thomas Tate, D.S. Schneider, How manyparameters does it take to describe disease tolerance? PLoS Biol. 14 (2016)e1002435, https://doi.org/10.1371/journal.pbio.1002435.

[28] Y.-S. Ang, R.N. Rivas, A.J.S. Ribeiro, R. Srivas, J. Rivera, N.R. Stone, K. Pratt,T.M.A. Mohamed, J.-D. Fu, C.I. Spencer, N.D. Tippens, M. Li, A. Narasimha,E. Radzinsky, A.J. Moon-Grady, H. Yu, B.L. Pruitt, M.P. Snyder, D. Srivastava,Disease model of GATA4 mutation reveals transcription factor cooperativity inhuman cardiogenesis, Cell 167 (2016) 1734e1749, https://doi.org/10.1016/j.cell.2016.11.033 e22.

[29] S. Banerjee, T. Tian, Z. Wei, N. Shih, M.D. Feldman, J.C. Alwine, G. Coukos,E.S. Robertson, The ovarian cancer oncobiome, Oncotarget 8 (2017) 36225,https://doi.org/10.18632/oncotarget.16717.

[30] J.-K. Lee, J. Wang, J.K. Sa, E. Ladewig, H.-O. Lee, I.-H. Lee, H.J. Kang,D.S. Rosenbloom, P.G. Camara, Z. Liu, P. van Nieuwenhuizen, S.W. Jung,S.W. Choi, J. Kim, A. Chen, K.-T. Kim, S. Shin, Y.J. Seo, J.-M. Oh, Y.J. Shin, C.-K. Park, D.-S. Kong, H.J. Seol, A. Blumberg, J.-I. Lee, A. Iavarone, W.-Y. Park,R. Rabadan, D.-H. Nam, Spatiotemporal genomic architecture informs preci-sion oncology in glioblastoma, Nat. Genet. 49 (2017) 594e599, https://doi.org/10.1038/ng.3806.

[31] V. Pedoia, J. Haefeli, K. Morioka, H.-L. Teng, L. Nardo, R.B. Souza, A.R. Ferguson,S. Majumdar, MRI and biomechanics multidimensional data analysis reveals R2 -R 1r as an early predictor of cartilage lesion progression in knee osteoar-thritis: multidimensional Data Analysis to Study OA, J. Magn. Reson. Imaging(2017), https://doi.org/10.1002/jmri.25750.

[32] S. Banerjee, T. Tian, Z. Wei, K.N. Peck, N. Shih, A.A. Chalian, B.W. O'Malley,G.S. Weinstein, M.D. Feldman, J. Alwine, E.S. Robertson, Microbial signaturesassociated with oropharyngeal and oral squamous cell Carcinomas, Sci. Rep. 7(2017), https://doi.org/10.1038/s41598-017-03466-6.

[33] D. Nagy-Szakal, B.L. Williams, N. Mishra, X. Che, B. Lee, L. Bateman,N.G. Klimas, A.L. Komaroff, S. Levine, J.G. Montoya, D.L. Peterson, D. Ramanan,K. Jain, M.L. Eddy, M. Hornig, W.I. Lipkin, Fecal metagenomic profiles insubgroups of patients with myalgic encephalomyelitis/chronic fatigue syn-drome, Microbiome 5 (2017), https://doi.org/10.1186/s40168-017-0261-y.

[34] G. Singh, F. Memoli, T. Ishkhanov, G. Sapiro, G. Carlsson, D.L. Ringach, Topo-logical analysis of population activity in visual cortex, J. Vis. 8 (2008), https://doi.org/10.1167/8.8.11, 11e11.

[35] J.L. Nielson, J. Paquette, A.W. Liu, C.F. Guandique, C.A. Tovar, T. Inoue, K.-A. Irvine, J.C. Gensel, J. Kloke, T.C. Petrossian, P.Y. Lum, G.E. Carlsson,G.T. Manley, W. Young, M.S. Beattie, J.C. Bresnahan, A.R. Ferguson, Topologicaldata analysis for discovery in preclinical spinal cord injury and traumaticbrain injury, Nat. Commun. 6 (2015) 8581, https://doi.org/10.1038/ncomms9581.

[36] J.L. Nielson, S.R. Cooper, J.K. Yue, M.D. Sorani, T. Inoue, E.L. Yuh, P. Mukherjee,T.C. Petrossian, J. Paquette, P.Y. Lum, G.E. Carlsson, M.J. Vassar, H.F. Lingsma,W.A. Gordon, A.B. Valadka, D.O. Okonkwo, G.T. Manley, A.R. Ferguson, TRACK-TBI Investigators, Uncovering precision phenotype-biomarker associations intraumatic brain injury using topological data analysis, PLoS One 12 (2017)e0169490, https://doi.org/10.1371/journal.pone.0169490.

[37] A. Madan, J.C. Fowler, M.A. Patriquin, R. Salas, P.R. Baldwin, K.M. Velasquez,H. Viswanath, D.L. Molfese, C. Sharp, J.G. Allen, et al., A novel approach toidentifying a neuroimaging biomarker for patients with serious mental illness,J. Neuropsychiatry Clin. Neurosci. (2017), https://doi.org/10.1176/appi.neuropsych.16090174.

Please cite this article in press as: L. Duponchel, Exploring hyperspectral im(2017), https://doi.org/10.1016/j.aca.2017.11.029

[38] A. Savic, G. Toth, L. Duponchel, Topological data analysis (TDA) applied toreveal pedogenetic principles of European topsoil system, Sci. Total Environ.586 (2017) 1091e1100, https://doi.org/10.1016/j.scitotenv.2017.02.095.

[39] Y. Lee, S.D. Barthel, P. Dłotko, S.M. Moosavi, K. Hess, B. Smit, Quantifyingsimilarity of pore-geometry in nanoporous materials, Nat. Commun. 8 (2017)15396, https://doi.org/10.1038/ncomms15396.

[40] M. Offroy, L. Duponchel, Topological data analysis: a promising big dataexploration tool in biology, analytical chemistry and physical chemistry, Anal.Chim. Acta 910 (2016) 1e11, https://doi.org/10.1016/j.aca.2015.12.037.

[41] J.J. Andrew, M.A. Browne, I.E. Clark, T.M. Hancewicz, A.J. Millichope, Ramanimaging of emulsion systems, Appl. Spectrosc. 52 (1998) 790e796.

[42] A. de Juan, M. Maeder, T. Hancewicz, R. Tauler, Use of local rank-based spatialinformation for resolution of spectroscopic images, J. Chemom. 22 (2008)291e298, https://doi.org/10.1002/cem.1099.

[43] F. Jamme, L. Duponchel, Neighbouring pixel data augmentation: a simple wayto fuse spectral and spatial information for hyperspectral imaging dataanalysis, J. Chemom. 31 (2017), https://doi.org/10.1002/cem.2882 n/a-n/a.

[44] R. Sibson, SLINK: an optimally efficient algorithm for the single-link clustermethod, Comput. J. 16 (1973) 30e34, https://doi.org/10.1093/comjnl/16.1.30.

[45] J. MacQueen, Some methods for classification and analysis of multivariateobservations, in: Proceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability, vol. 1, 1967. Statistics.

[46] S. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (1982)129e137, https://doi.org/10.1109/TIT.1982.1056489.

[47] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detectingcompact well-separated clusters, J. Cybern. 3 (1973) 32e57, https://doi.org/10.1080/01969727308546046.

[48] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. PatternAnal. Mach. Intell. 1 (1979) 224e227.

[49] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and valida-tion of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53e65, https://doi.org/10.1016/0377-0427(87)90125-7.

[50] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques,J. Intell. Inf. Syst. 17 (2001) 107e145, https://doi.org/10.1023/A:1012801612483.

[51] Y. Xie, V.V. Raghavan, P. Dhatric, X. Zhao, A new fuzzy clustering algorithm foroptimally finding granular prototypes, Int. J. Approx. Reason 40 (2005)109e124, https://doi.org/10.1016/j.ijar.2004.11.002.

[52] S. Bandyopadhyay, S. Saha, A point symmetry-based clustering technique forautomatic evolution of clusters, IEEE Trans. Knowl. Data Eng. 20 (2008)1441e1457, https://doi.org/10.1109/TKDE.2008.79.

[53] L. Vendramin, R.J.G.B. Campello, E.R. Hruschka, Relative clustering validitycriteria: a comparative overview, Stat. Anal. Data Min. (2010), https://doi.org/10.1002/sam.10080.

[54] I. Gurrutxaga, I. Albisua, O. Arbelaitz, J.I. Martín, J. Muguerza, J.M. P�erez,I. Perona, SEP/COP: an efficient method to find the best partition in hierar-chical clustering based on a new cluster validity index, Pattern Recognit. 43(2010) 3364e3373, https://doi.org/10.1016/j.patcog.2010.04.021.

[55] K.R. �Zalik, B. �Zalik, Validity index for clusters of different sizes and densities,Pattern Recognit. Lett. 32 (2011) 221e234, https://doi.org/10.1016/j.patrec.2010.08.007.

[56] M.K. Pakhira, Finding number of clusters before finding clusters, ProcediaTechnol. 4 (2012) 27e37, https://doi.org/10.1016/j.protcy.2012.05.004.

[57] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J.M. P�erez, I. Perona, An extensivecomparative study of cluster validity indices, Pattern Recognit. 46 (2013)243e256, https://doi.org/10.1016/j.patcog.2012.07.021.

[58] D.H. Wolpert, W.G. Macready, Coevolutionary free lunches, IEEE Trans. Evol.Comput. 9 (2005) 721e735, https://doi.org/10.1109/TEVC.2005.856205.

[59] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEETrans. Evol. Comput. 1 (1997) 67e82, https://doi.org/10.1109/4235.585893.

[60] D.H. Wolpert, The lack of a priori distinctions between learning algorithms,Neural Comput. 8 (1996) 1341e1390, https://doi.org/10.1162/neco.1996.8.7.1341.

aging data sets with topological data analysis, Analytica Chimica Acta