Chem a Eur J ATW Full Paper

download Chem a Eur J ATW Full Paper

of 11

Transcript of Chem a Eur J ATW Full Paper

  • 8/14/2019 Chem a Eur J ATW Full Paper

    1/11

    1

    FULL PAPER

    DOI: 10.1002/chem.2008())

    Design of a Full-Profile Matching Solution for High-Throughput Analysis of Multi-Phases Samples Through Powder X-Ray Diffraction

    Laurent A. Baumes*, Manuel Moliner and Avelino Corma

    *[a]

    Abstract: Few solutions that aim atidentifying crystalline materials fromthe analysis of powder X-ray

    diffraction (XRD) data have beenreported so far. A careful inspection ofthe latter and the correspondinghighlight of specific failures when usedfor the discrimination ofcrystallographic phases of zeolitesamong mixtures, has allowed thecreation of the recently proposedstrategy called Adaptable TimeWarping (ATW). Here, the designprocess is thoroughly detailed in a step by step manner allowing a deepunderstanding of the motivations,

    improvements, and the resultingremarkable properties of ourmethodology. As the use of high-

    throughput techniques for the discoveryor the synthesis space enlargement ofnew microporous crystalline structuresmakes the trustworthiness of search-match methods a critical point to beassessed, a meticulous evaluation of thereliability and the robustness is provided and supported by bothempirical comparisons andmathematical proofs. The resultsoffered by our methodology, as itclearly outperforms the well-established solutions, open the way

    toward a total automation of such aroutine procedure, eliminatinglaborious and time-consuming controls,

    preliminary treatments and settings.Consequently, the proposed solution isof great interest and appears to be very promising considering the numerous potential applications of X-raydiffraction in material science, but alsothe possible extent to several othercharacterization techniques.

    Keywords: x-ray diffraction fullprofile warping distance mixture high-throughput

    Introduction

    The knowledge of the structure at atomistic/molecular level of a

    material is required for any advanced understanding of its

    performance. In order to have access to the material micro-structure,

    numerous generic characterization tools exist, each of them

    providing very specific information either on the structure or the

    chemical properties, but they are time-consuming. While most of the

    methods are performed ex-situ, in-situ characterization techniques

    are developing rapidly, with a major effort for parallelization and

    high-throughput[1] (HT). Nevertheless HT characterization allows

    saving machine time, the corresponding high volume of data

    generated requires an automatic data analysis system to avoid

    slowing down the whole process. With respect to the targeted

    information, most of the responses take the form of series where the

    measurements follow an ordered variable; for example, temperature

    (thermogravimetry analysis e.g. TGA, temperature programmed

    desorption-reduction-oxidation, e.g. TPD, TPR, TPO), the time (gas

    chromatography, e.g. GC), the wave length (photoluminescence,

    infra-red, e.g. IR, ultraviolet-visible adsorption, e.g. UV-vis), the

    voltage (voltammetry), or the angles (x-ray diffraction, e.g. XRD),

    etc. Here, we focus on powder diffraction data to be automatically

    labeled by their corresponding phases (single or mixture). Moreover,

    such analysis is intended not only to identify crystalline materials by

    comparing diffraction data against a database, but also to detect the

    formation of an unknown phase. Few solutions[2,3,4] that aim at

    extracting phases ID directly from powder X-ray diffraction data

    have been reported so far. Despite the excellent results[5,6] reported

    for these techniques, their application on mixtures of zeolite

    crystallographic phases gave us motivations to further investigatetheir reliability and robustness. A careful inspection of the latter and

    the corresponding highlight of specific/representative failures

    brought us to create a new technique called Adaptable Time

    Warping (ATW).[7] Here, the design process of the recently

    proposed ATW strategy is thoroughly detailed in a step by step

    manner. This allows a deep understanding of the motivations,

    improvements, and the resulting remarkable properties of the

    methodology.

    Three different studies and the corresponding results using the

    commercial software PolySnap2[3,4] from Bruker-AXS are

    presented in the first part of thus report. Then, a close examination

    of the responses will allow to point and understand the weaknessesof such an approach for the type of problems under study. This will

    be illustrated by the selection of few representative mistakes for

    [a] Dr. L. A. Baumes, Dr. M. Moliner, Dr. Prof. A. Corma

    Instituto de Tecnologia Quimica

    CSIC-Universidad Politecnica de Valencia

    UPV, Av. de los Naranjos, s/n E-46022 Spain

    Fax: (+34) 963879444

    E-mail: {baumesl; mmoliner; acorma}@itq.upv.es

    Supporting information for this article is available on the WWW under

    http://www.chemeurj.org/

  • 8/14/2019 Chem a Eur J ATW Full Paper

    2/11

    2

    which some corrective adaptations of the methodology are described

    in the third section. It is shown how such modifications can be

    merged together making the so-called Adaptable Time Warping a

    fundamentally different, powerful and innovative approach. Its

    evaluation on numerous benchmarks allows supporting empirically

    the mathematical proofs of ATW robustness and reliability property.

    Finally, results using ATW for the same cases will show the clear

    superiority of this methodology with respect to those previously

    reported.

    Figure 1. The automatic analysis of the 192 X-Ray diffraction data of the discovery

    program of the zeolite ITQ-33 is expected to identify the different crystallographic

    phases present in each sample, allowing the creation of the corresponding phase

    diagram (8 different phases, numerous mixtures, and among them the presence of a new

    material has to be established).

    Identification of Zeolite Crystallographic Phases

    Zeolites are crystalline alumino-silicates, whose structures are

    3D networks composed of channels and cavities of molecular

    dimensions. Their microporous framework structures, the wide

    range of chemical composition and surface acidity, and the

    possibility of tuning their electric fields are key factors that render

    zeolites versatile materials for an increasing number of

    applications.[8]

    Among them, is the use of zeolites as catalysts (for the

    petrochemical industry, for pollution control and for the synthesis of

    special chemicals), for gas adsorption and separation, electronics

    and medical uses.[9] The discovery of new structures or enlarging the

    synthesis space, and optimization of existing ones require a

    considerable experimental effort. This can be diminished by using

    high throughput (HT) technology (usually parallelization and

    miniaturization)[10]

    as also done for heterogeneous catalysts.[11]

    As arecent example, this technique has allowed the synthesis of a very

    unique zeolite structure that includes extra-large 18MR connected

    with medium 10MR pores (ITQ-33).[12] The unusual conditions

  • 8/14/2019 Chem a Eur J ATW Full Paper

    3/11

    3

    needed to synthesize this new material were found by using HT

    techniques to cover a wide range of possible synthesis conditions.

    That study required the generation of 192 diffractograms to follow

    the formation of the different crystallographic phases and mixtures

    of phases depending on the synthesis conditions, and from those

    mixtures of phases the presence of a new zeolite structure had to be

    established, see Figure 1. To achieve this is not obvious when

    complex mixtures of several crystalline phases exist in the same

    samples and consequently, the identification of the phases is time

    consuming. At the same time when increasing the amount of

    experiments to cover a broader synthesis range, it becomes

    mandatory to develop an automatic data treatment not to slow down

    the whole process.[13] Therefore an important step in this direction is

    the automatic classification of crystalline structures through XRD

    data, as it has been previously pointed by Takeushi et al.[6] For doing

    this PolySnap2[3] has been selected as a reference because the

    software is representative and highly relevant considering the

    current state of the art in HT identification of phases through full

    profile examination. Detailed explanations are available, while

    merging various techniques: principle component analysis (PCA),

    multi-dimensional scaling (MDS), cluster analysis,

    [14]

    and statisticaltests. As PolySnap2 offers an automatic analysis where user

    interface is minimized to few options (allows x-shift and check

    for amorphous have been selected), this mode is chosen for

    comparison. Three datasets of experiments have been selected:

    Zeolite ,[10b] ITQ-21/30,[10a,10d] and ITQ-33[12].

    Zeolite - In order to optimise the preparation conditions for

    the synthesis of beta zeolite, a factorial design (3242) is employed

    for exploring different molar gel composition. The molar gel

    composition is explored by varying the following molar ratios:

    Na/(Si+Al), TEA/(Si+Al), OH/(Si+Al) and H2O/(Si+Al), while

    Si/Al ratio is fixed. The experimental design considers the following

    four molar ratios (level): Na/(Si+Al) (3) ranging from 0 to 0.5;TEA/(Si+Al) (4) from 0.1 to 0.6; OH-/(Si+Al) (4) from 0.15 to 0.52;

    and H2O/(Si+Al) (3) from 5 to 15. The total number of samples

    synthesised considering this factorial design is 144. The materials

    obtained in the studied area are amorphous materials, zeolite beta,

    and another dense material (named as UDM).

    ITQ-21/30 - The gel composition is explored by varying the

    following molar ratios: Al/(Si+Ge), MSPT/(Si+Ge), F-/(Si+Ge) and

    Si/Ge. A factorial experimental design (4 x 32 x 22 =144) is selected

    for studying simultaneously the crystallization time and the

    composition of the gel varying the molar ratios of the components.

    Two zeolites are found in the area of the phase diagram studied:

    ITQ-21 and ITQ-30.

    ITQ-33 - Hexamethonium is used as structure directing agent

    (SDA). An initial experimental factorial design (343) is selected.

    Si/Ge, TIII/(Si+Ge), OH-/(Si+Ge), and H2O/(Si+Ge) are the

    synthesis variables. This experimental design considers the

    following four molar ratios (level): Si/Ge (4) ranging from 2 to 30;

    B/(Si+Ge) (4) from 0 to 0.05; OH-/(Si+Ge) (3) from 0.1 to 0.5; and

    H2O/(Si+Ge) (4) from 5 to 30. The number of synthesized samples

    is 192. Hexamethonium is a promising structure directing agent

    (SDA), since this molecule has a high number of degrees of freedom.

    Such flexibility allows different conformations that stabilize several

    competing structures, such as EU-1, ITQ-17, ITQ-22, ITQ-24, SSZ-

    31, a lamellar phase, and the new structure, ITQ-33.

    When PolySnap2 is employed for the studies ITQ-33,

    Zeolite , and ITQ-21/30 the overall recognition rates are

    68.2%, 84.3%, 86.11% and respectively. Despite the relative

    easiness of study and ITQ-21/30, i.e. only three different phases

    (including amorphous) are present with their corresponding

    mixtures, 15.7% and 13.89% of errors are respectively obtained.

    Among them, the majority corresponds to the fact that partially

    crystalline materials are not correctly identified. The same

    observation can be done for ITQ-33 study. This makes the number

    of false negative, i.e. the presence of a given crystallographic phase

    which is not detected, relatively high, see the first line of the Tables

    1 and 2. Confusion matrix for ITQ-21/30 study is given in

    Supplementary information as Table S2. Among them, growing

    phases represent the principal source of misidentifications, and it

    can be observed that mixtures of pure phases (without the presence

    of amorphous) are also poorly recognized.

    Table 1. Confusion matrix with the real phases in the experimental design Beta study

    versus predicted classes obtained with PolySnap2 indicated at the top of the cell, and

    versus ATW indicated at the bottom of the cell (in italic and between brackets).

    Considering the selected applications, an adapted methodology

    is still lacking taking into account that one specific structure can

    present large differences in peak intensity and diffraction angle

    within the XRD diffractogram, depending on the level of

    crystallinity, crystallite size, and chemical composition.

    Subsequently, the synthesized powder which presents a mixture of

    phases and impurities makes the automatic determination of

    crystalline phases through X-ray diffraction data a real challenge. A

    closer look on mistakes should allow understanding why the

    methodology is failing. Therefore, some representative failures are

    extracted from the above treatments, see Figures 1 to 3.

    PolySnap2 operations[4] can be summarized as follows:

    i) each input sample is checked against a database of known phases,

    ii) the criterion based on the Pearson and Spearman correlation

    Real phases

    Amorphous

    UDM

    GrowingUDM

    Beta

    GrowingBeta

    Mixture

    GrowingMixture

    Am.75

    (76)

    1 9

    (1)

    2 2 89

    (77)

    UDM11

    (12)

    2

    (1)

    13

    (13)

    Grow

    UDM

    0

    (9)

    0

    (9)

    Beta27

    (28)

    2 29

    (28)

    Growing

    Beta

    0

    (1)

    0

    (2)

    0

    (3)

    Mixture1 2 0

    (1)

    3

    (1)

    Growing

    Mixture

    0

    (1)

    0

    (2)

    0

    (3)

    Predictedphases

    76 12 11 29 2 2 2 84.3%

    (97%)

  • 8/14/2019 Chem a Eur J ATW Full Paper

    4/11

    4

    coefficients is calculated taking into account all measured data

    points (full profile), iii) if the pattern correlations do not indicate a

    good match, then the percentage composition of the sample is

    determined, i.e. quantitative analysis, using the singular value

    decomposition (SVD) for matrix inversion, and all the known

    phases as potential elements of the composition.

    Table 2. Confusion matrix with the real phases in the experimental design ITQ-33

    versus predicted classes obtained with PolySnap2 indicated on the left hand-side of

    the cell, and versus ATW indicated on the right hand-side of the cell (in italic andbetween brackets).

    [a] ITQ-24 + amorphous. [b] SSZ-31 + amorphous.

    Figures 2-a ,3-b, and 2-c highlight a first problem inherent to

    methodologies for which a statistical criterion based on a correlation

    coefficient is employed to measure the strength (and direction) of

    the relationship between diffractograms. As highest peaks get the

    greatest influence on the variation of the criteria, the local

    overlapping of diffractograms considering the reflections with high

    intensities, such as for ITQ-22 and ITQ-24 at 11 and between 20

    and 23, see Figure 2-a, can mislead the methodology when used in

    the case of ITQ-33. The same confusion appears for the sample

    containing both zeolite Beta and UDM in Figure 3-b. In this case,

    smaller peaks belonging to UDM are clearly visible, but their

    influence remains relatively too low compared to the main phase.

    This problem is tackled in PolySnap2 as it integrates various

    techniques and criteria in order to better handle such kind of

    complications. Gilmore et al. define a general criterion as the linear

    combination (half-half for the automatic mode) of Pearson and

    Spearman coefficients.[3] Spearmans coefficient is a non-parametric

    measure of correlation and a special case of Pearson product-

    moment as the data is first converted to ranking before calculating

    the coefficient. This has the advantage to decrease the influence of

    the nominal peaks height. However, the criterion is not strongenough by itself. When combined with Pearson coefficient, it simply

    behaves as an indicator for further user attention.

    Figure 2. Predicted errors by using PolySnap2. a) The real phase is ITQ-22 (#19),

    being the software prediction mixture ITQ-22 with ITQ-24, b) it is growing the

    Lamellar phase, being the software prediction amorphous, and c) it predicts pure ITQ-

    17 (#217), being a mixture between ITQ-17 and ITQ-24.

    Predicted phases

    Othermixtures

    22+24+33

    22+24

    17+24

    24+33

    Lamellar+24

    ITQ-17

    Lamellar

    SSZ-31

    EU-1

    ITQ-24

    ITQ-22

    Amorphous

    55

    55(55)

    Am.

    38

    (2)

    36(36)

    2

    ITQ-22

    19

    6(2)[a]

    8(17)

    2 3ITQ-24

    3 1(3)

    2EU-1

    16

    (1)[b]

    13(15)

    3

    SSZ-31

    46

    18(46)

    28

    Lam.

    0 0(0

    )

    ITQ-17

    8 3 0(8)

    1 2 2

    Lam.+24

    2 2 0(2)

    24+33

    2 1 0(2)

    1

    17+24

    2 1 0(2)

    1

    22+24

    1 0(1)

    1

    22+24+33

    131/192=68.2%

    (187/192=97.4%)

    13(3)

    0(1)

    0(4)

    0(2)

    0(2)

    0(8)

    1(0)

    19(46)

    13(14)

    1(2)

    10(17)

    40(36)

    95(54)

    Realphases

  • 8/14/2019 Chem a Eur J ATW Full Paper

    5/11

    5

    Similarly, Figures 2-b and 3-a point the problem of detecting a

    growing phase in presence of amorphous. As explained in Ref[3], if

    the ratio of non-background to background intensity falls below a

    default limit of 3% and if, additionally, there are less than a default

    value of 3 peaks, the sample is flagged as non-crystalline. Here, the

    conception of the methodology consisting in an a priori separation

    of the different situations, i.e. one pure phase-mixture of phases-

    amorphous sample, through thresholds explains such errors.

    Another intricacy encountered during the analysis of ITQ-33

    data is represented in Figure 4. It can be observed how the

    increasing integration of Germanium modifies the dimension of the

    unit cell due to [Ge-O] longer bond length than [Si-O], and

    consequently shifts the diffractograms along x-axis. The difficulty

    of such situation arises from the fact that the shifting is not constant

    along the entire angle range as emphasized in Figure 4-b. This has a

    direct influence on the criteria and thus on the identification as most

    techniques use the Euclidian distance at the basis of their calculation,

    i.e. the differences of intensity are calculated at the same angle

    positions. The resulting is that most of these diffractograms are

    interpreted as mixtures of ITQ-24 and another phase, i.e. falsepositive.

    Figure 3. Predicted errors by using PolySnap2. a) Growing phases, UDM (#21) and

    Beta (#45), being the software prediction amorphous. b) Mixtures between Beta and

    UDM, being the software prediction Beta phase.

    Figure 4. XRD patterns for ITQ-24 zeolites obtained in the ITQ-33 study with

    different Si/Ge ratio a) full pattern, and b) small angle range (19-24).

    To summarize, when facing difficult problems where samples

    present large differences in the intensity of peaks and diffraction

    angles, the traditional methodologies obviously fail. As basic

    statistical correlation coefficients allow the identification of pure

    phases in most of the cases, except for the latter, the real strength

    and need of new strategies is expected where the current

    methodologies does not succeed or simply warns for further user

    attention. Such tricky situations can be divided into 5 groups: i) the

    treatment of the highest peaks which either mislead the

    crystallographic phase assignment due to local overlapping, Figure

    2-a, or hide the formation of competitive phases due to their

    relatively big weight into the criterion, Figure 2-c; ii) the

    simultaneous occurrence of small peaks representative of a given

    growing phase does not take enough importance among the evident

    presence of another phase, Figure 3-b; iii) the treatment of partially

    but predominantly amorphous samples, Figures 2-b and 3-a, is not

    accurate enough leading to numerous false negatives; iv) similarly,

    the shifting of diffractograms along angle axis, Figure 4,

    undoubtedly appears as one of the limitation of Euclidian-based

    criteria leading to uncertainty; iv) finally the effective detection of

    the presence of a new phase, or zeolite structure, which remains oneof the most required feature of the strategy still needs improvements.

    The three samples containing the growing ITQ-33 zeolite have been

    classified as mixtures of other pure phases, Table 2.

    Despite the fact that other full-profile techniques are available

    either in PolySnap2 or from other sources, their overall conception

    remains very similar and always refers to statistical criteria which

    determine the similarity between the diffractograms. Such similarity

    is used either for a direct labelling of the samples as shown before,

    or for the visual representation of numerous XRD in a way to

    provide a quick idea of how diffractograms may be grouped,[2,14]

    such as principle component analysis (PCA), multi-dimensional

    scaling (MDS), cluster analysis. However, these traditionalalgorithms initially created for multi-dimensional data are not

    adequate when series are considered. The reason for this is mainly

    due to the lack of knowledge on the intrinsic ordering data

    particularity. Most of them are similarity/dissimilarity-based

    approaches, where the similarity between instances is usually

    explicitly expressed by the classical p-norm distance, e.g. Euclidian.

    Numerous algorithms are available, but only few distances have

    been proposed for the special treatment of series and, among them,

    the Dynamic Time Warping[15] (DTW) appears as the most famous

    one. As warping distances have never been integrated into search-

    match approaches, we propose this first modification to tackle the

    problem related to shifting. On the other hand, even though all

    points are considered in full-profile search-match methodologies,the whole available information is not fully used. Before analyzing

    unknown samples, the selection of potential pure phases can be a

    priori analyzed, the information being contained in these patterns

    should give precious indications on how the different phases can be

    identified taking into account the presence of other ones. Such kind

    of analysis can be handled by machine learning (ML) techniques.

    The aim is to localize strategic traits/features of each individual

    phase which makes them unique taking into account the competitive

    phases. Moreover, additional information can be inserted before

    launching the automatic phase recognition, such as previous

    experiments, i.e. the use of internal databases, o theoretical

    diffraction data simulating the formation of unsynthesized mixtures

    or the integration of new elements into the crystal network, such asGermanium considering ITQ-33 dataset.

  • 8/14/2019 Chem a Eur J ATW Full Paper

    6/11

  • 8/14/2019 Chem a Eur J ATW Full Paper

    7/11

    7

    crystallographic phases, previous experiments, and theoretical

    calculations. The aim of analyzing additional data is to modify the

    similarity between diffractograms in order to improve recognition of

    mixtures, with or without the presence of amorphous. A way to

    achieve this is to modify the influence of the intensities for each

    angle position. Since the matrix which is employed for DTW

    calculation is a squared matrix of dimension t, i.e. the total number

    of angles steps, it can be used to integrate weights. A possible way

    to correctly assign the values of each weight is to learn how

    intensities must be interpreted or modified in order to get a correct

    answer from the labelling procedure. Due to the large number of

    angle positions, a machine learning (ML) approach must be

    employed. By presenting various cases (better if complicated) either

    from past experiments or theoretical calculations, we expect the

    setting of weights to improve the classification results.

    Our methodology Adaptable Time Warping (ATW) is designed,

    as all warping (including the famous DTW), and p-norm distances

    are particular cases of ATW. Let a tt matrix, be the set of

    parameters required to compute ATW,+ = i j

    [ ], 1, i j t . If ij=, ai is not allowed to be matched with bj.A(Ai, Bj, ij) is the recursive function that computes the distancebetweenAi andBj. ATW is defined by Equation 2.

    Considering a set of n time series and a given classification

    algorithm, we expect optimal settings for ATW distance consideringthe given problem and related data. Two different algorithms are

    required: a classification technique, and another one for the

    automatic weight setting. As an example and in order not to make

    the methodology too complex, the instance-based learning (IBL)

    method called k-nearest-neighbors (k-nn) is chosen as classification

    strategy. An IBL approach also called memory-based or case-based

    learning is defined as the generalizing of a new instance to be

    classified from the stored training examples. Training examples are

    processed when a new instance arrives. Each time a new query

    instance is encountered, its relationship to the previously stored

    examples is examined to assign a target function value for the new

    instance. Three characteristics may define IBL algorithms: i) A

    similarity function which tells the algorithm how close together two

    instances are. ii) A typical instance selection function: this tells

    the algorithm which of the instances to keep as examples. iii) A

    classification function which relates a new case to the learned cases.

    IBL approaches are chosen since the similarity function that is

    employed plays a central role as stressed by the definition given

    earlier. The following are the most common IBL methods: k-nearest

    neighbors (k-nn), locally weighted regression, and radial basis

    function (RBF). k-nn algorithm is selected for simplicity since it is

    the most basic of all IBL methods, but also because the aim is to

    stress on the proposed similarity measure and not on the algorithm

    which uses it. Moreover, superior performances may be easily

    obtained as k-nn algorithm can be refined by weighting the

    contribution of each of the kneighbors according to their distance to

    the query point giving greater weight to closer neighbors such as for

    RBF. One very important trait of the methodology is its

    independence from the classification algorithm, i.e. others could be

    employed as well.

    At this level of the design, it remains defining the procedure for

    weight assignment. Since the aim is to improve labelling results, the

    best matrix is defined as the matrix that minimizes the

    classification error rate with ATW and the chosen learning strategy,

    see Figure 7. The brute force solution would perform theclassification with all the possible values of , and finally keeps the

    one that corresponds to the best error rate. Each possible matrix

    must be tested for obtaining the error rate. However, the resulting

    complexity in time of the exhaustive search is intractable. This

    implies the use of a heuristic which aims at quickly finding a

    solution, i.e. obtaining a locally optimal matrix . Genetic

    algorithms (GAs) are selected because there is a body of theory

    providing theoretical evidence to complement the empirical proof of

    their robustness. Such techniques are now consistently used. The

    reader is referred to Ref[26] for an introduction and advanced

    discussions.

    Figure 7. ATW conception. Based on classification errors of training samples, the

    matrix is modified until reaching a given termination criterion, i.e. the convergence to a

    minimum. Finally the best matrix is used for unseen test samples.

    Let X be a set of n time series, X = {x1,,xn), and a

    population ofp individuals {1,, p). Each individual is a

    candidate matrixi that owns t variables. The terms individuals,

    population, etc are used following the GA terminology. Each

    cell of the matrix is a variable that the GA optimizes. The GA (i.e.

    all individualsi) is initialized by assigning random values to every

    cell ij. The evaluation, then the selection, and finally the

    mutation/recombination are applied to each generation. The details

    about the use of genetic algorithm as optimization tool are explained

    in supplementary information. Our approach is a learning process

    which uses a heuristic, and allows reaching near-optimal solutions.

    According to a given classification problem, ATW approach adapts

    itself for better capturing data specificity. ATW can be associated to both warping and non-warping distances as demonstrated by the

    proof given in supplementary material. Therefore, whatever the

    temporal classification problem, optimal parameters imply ATW

    ( ) ( )

    ( )

    ( )

    ( )( )

    ' '

    ' '

    ' '

    1 ( 1)

    ' '

    1 ( 1)

    ' '

    1 ( 1)

    ' '

    1

    , , , , (Equation 2)

    , ,

    0, 1

    , , , 1 1

    , , , 1 1

    , ,

    min , ,

    =

    =

    = =

    = >

    > == +

    A t t t t

    A i j i j

    i j

    A i j i j

    A i j i j

    i j i j

    A i j i j

    A i j i

    ATW A B A B

    with A B

    if

    if i j

    A B if i and j

    A B if i and j

    a bA B

    A B

    ( )

    ( )

    [ ]

    ( 1)

    ' '

    1 1 ( 1) ( 1)

    11 1

    1

    ,

    , ,

    , , 1,

    +

    =

    j

    A i j i j

    t

    i j

    t t t

    else

    A B

    with i j t

  • 8/14/2019 Chem a Eur J ATW Full Paper

    8/11

    8

    to give results at least equal or superior to other distances, and

    consequently ATW is always at least equivalent or superior to all

    other distance use. In the following section, results for 20

    benchmarks[27] and the three real datasets of zeolite investigations

    are given, confirming empirically the mathematical proof.

    Results and Discussion

    The k-nn algorithm is employed with k=3. The calculation of the

    error rate on the learning dataset is done using k-CV with k=5.

    The lowest error rate on the validation set ( i.e. the winner matrix) is

    kept for comparison with other techniques which have been

    evaluated with the same validation set. The number of iteration (r)

    for ATW matrices optimization is set to 50, and the population size

    (p) is set to 4.pmutand q are respectively set to {0.1; 5}.

    Figure 8 and Table S1 in supplementary material shows that

    ATW always constitutes the lower bound of error rates when is

    compared to both DTW and Euclidian distance use on the 20

    different benchmarks.[27] ATW is superior to DTW (either with the

    best delta value or without warping window) and the Euclidiandistance for 7 benchmarks (see bold values).

    Figure 8. Evaluation of ATW of 20 different benchmarks. [27] For all the cases, ATW

    reaches the lowest classification error indicated on the y-axis.

    Considering zeolite dataset, DTW performs nearly two

    times better than the Euclidian distance but here again ATW allows

    to reach a better performance. For ITQ-21/30, DTW and

    Euclidian distances perform equally while ATW decreases the error

    rate by half. In those datasets there are only few different classes

    with low number of mixtures between them. However, in the zeolite

    synthesis, it is normal to crystallise a high quantity of different

    materials, with mixtures of two or more different zeolites. The ATW

    methodology can not only predict the pure phase but also the

    mixture ordered by decreasing degree of crystallization of each one.

    For doing that, the algorithm compares to next neighbours getting

    the distance to them, and the output phase is ordered depending on

    that distance. Moreover, depending on the crystal composition, it is

    possible to obtain some shifts in the peak positions, although it is the

    same phase. Therefore, one advantage of ATW is to consider peak

    position shifts, which are common for zeolites with different Si/Ge

    ratio. For this reason, the dataset ITQ-33 is an excellent proof to

    check the ability of the ATW to classify such materials. There are

    eight different materials (amorphous, ITQ-17, ITQ-22, ITQ-24, EU-

    1, Lamellar phase, SSZ-31, and ITQ-33) with numerous mixtures

    between them. The ratio Si/Ge is also varied in the study getting

    variations in the peaks positions. The last challenge that shows the

    experimental design is the discovering of a new crystalline phase,

    ITQ-33, in a very narrow condition range. In Figure 1 it can be

    observed the occurrence of each competing phase as a function of

    the composition of the starting gel. Despite of the non-linearity of

    the system, it is clearly defined the range of composition in which a

    specific phase is formed. However, there are some narrow ranges of

    composition in which two or more phases are competing. Zeolite

    ITQ-33 was growing with ITQ-24 in an specific area of the phase

    diagram. The conditions for ITQ-33 synthesis are extremely unusual,

    due to the low OH-/(Si+Ge) ratio in a concentrated gel H2O/(Si+Ge)

    = 5, in the presence of a trivalent element ((Si+Ge)/B = 20 or 50)

    and germanium (Si/Ge = 2). When some variable was changed, the

    ITQ-33 phase disappears and ITQ-22 or ITQ-24 are formed. WhenATW is applied, the classification error is below 4% (see Table 2),

    considering all the mixtures obtained in the experimental design.

    Moreover, if the confusion matrix is studied in detail, one can see

    that 2 of the 5 errors are due to the fact that the algorithm predictsalso the amorphous materials that are contained in these samples,

    being more accurate than our manual classification. Amorphous

    material is considered as a regular phase, even when the amorphous

    content is impurity. Two other errors come from two predicted

    mixtures ITQ-22/ITQ-24 while the real phases are pure ITQ-22. In

    fact, ITQ-24 was in the limit of significance, being a most minority

    phase, and the SVD procedure assigns less than 3%.

    Conclusion

    ATW is a new time series distance that can be adapted

    according to the classification issue, using a genetic algorithm. GA

    components (chromosomes and operators) have been adapted for the

    task in-hand. Such hybridization is shown of great interest for the

    classification goal while the adaptation of the GA to the problem

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    SyntheticControl

    Gun-Point

    CBF

    Face(all)

    OSULeaf

    SwedishLeaf

    50Words

    Trace

    TwoPatterns

    Wafer

    Face(four)

    Lightning-2

    Lightning-7

    ECG

    Adiac

    Yoga

    Fish

    Beef

    Coffee

    OliveOil

    Euclide DTW(best d value)

    DTW (no warping window) ATW

  • 8/14/2019 Chem a Eur J ATW Full Paper

    9/11

    9

    form allows applying successfully the new strategy to complex

    datasets. Besides its obvious identification superiority, a large

    number of advantages are inherent to ATW such as the following:

    there is no background removal, no peak extraction, no smoothing,

    no interpolation, no region exclusion from the user, no pre-defined

    shifts, no false negative, the best is automatically found, it handles

    properly amorphous samples, captures phases characteristics while

    reducing the user interaction to the minimum, and finally encounters

    new crystallographic phases.

    Figure 9. Coupling of ATW inside our platform hITeQ with ITQ existing HT tools

    ATW clearly appears as a new method for automatically

    deciphering X-ray diffraction patterns, allowing crystallographic

    phases to be identified quickly and reliably. As XRD is used for a

    wide range of purposes, from routine characterization in industrial

    procedure control through in depth research investigations of the

    most complex high technology materials, such kind of methodology

    which can be easily coupled with existing high-throughput synthesis

    apparatus, see Figure 9, is a real breakthrough. It has been shown

    how the phase diagram corresponding to a broad research space can

    be quickly revealed making clearer the way towards new structures.

    Acknowledgements

    EU Commission FP6 (TOPCOMBI Project) is gratefully acknowledged. We also thank

    Santiago Jimenez for his important collaboration to set the hITeQ platform which

    integrates all programming codes of the presented methodology. Laurent A. Baumes

    thanks Stephan Schunk from hte Company for his comments on the technique.

    [1] a) M. Fransen,Am. Lab. 2002, 34, 42-49; b) R. Meier, M. Dirken,Int. Cement &

    Lime Journal.2002, 1, 44-47.

    [2] a) I. Takeuchi, C. J. Long, O. O. Famodu, M. Murakami, J. Hattrick-Simpers, G.

    W. Rubloff, M. Stukowski, K. Rajan. Rev. Sci. Instrum. 2005, 76, 062223; b)

    X'Pert HighScore Plus software from PANalytical, see website visited Nov. 20 th

    2008, http://www.panalytical.com/index.cfm?pid=547; c) D. L. Bish, S. A.

    Howard,J Appl Crystallogr. 1988, 21, 86-91; d) D. L. Bish, S. J. Chipera, in

    Advances in X-ray Analysis (Ed.: C. Barrett, et al.), Plenum Press, New York

    1988, 31, pp. 295-308; e) D. L. Bish, S. J. Chipera, in Advances in X-ray

    Analysis (Ed.: P. Predecki, et al.), Plenum Press, New York 1995, 38, pp. 47-

    57; f) S. J. Chipera, D. L. Bish,Powder Diffr. 1995, 10, 47-55; g) F. H. Chung,J.

    Appl. Crystallogr. 1974, 7, 519-525; h) RockJock Uses Microsoft Excel

    Macros and the Solver function to perform a whole-pattern modified Rietveld-

    type refinement to perform quantitative analysis. Written by Dennis D. Eberl,

    the software was published in 2003 as U.S.G.S. Open-File Report 03-78,

    Determining Quantitative Mineralogy from Powder X-ray Diffration Data.

    Available via FTP from ftp://brrcrftp.cr.usgs.gov/pub/ddeberl/RockJock. i) See

    JADE from http://www.rigaku.com/index_world.html.

    [3] a) C. J. Gilmore, G. Barr, J. Paisley,J. Appl. Crystallogr.2004, 37, 231242; b)

    G. Barr, W. Dong, C. J. Gilmore,J. Appl. Crystallogr.2004, 37, 243252.

    [4] G. Barr, W. Dong, C. J. Gilmore,J. Appl. Crystallogr.2004, 37, 658664.

    [5] R. Storey, R. Docherty, P. D. Higginson, C. Dallman, C. J. Gilmore, G. Barr, W.

    Dong, Cryst. Rev. 2004, 10, 45-56.

    [6] C. J. Long, J. Hattrick-Simpers, M. Murakami, R. C. Srivastava, I. Takeuchi, V.

    L. Karen, X. Li.Rev. Sci. Instrum. 2007, 78, 072217.

    [7] L. A. Baumes, M. Moliner, A. Corma. ChemEngComm. 2008, 10, 1321-1324

    [8] A. Corma,J. Catal. 2003, 216, 298-312.

    [9] H. Lee, S. I. Zones, M. E. Davis,Nature.2003, 425, 385-387.

    [10] a) A. Corma, M. J. Daz-Cabaas, J. Martnez-Triguero, F. Rey, J. Rius,Nature.

    2002418, 514-517. b) O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo,

    A. Karlsson, M. Plassen, K. P. Lillerud, Stud. Surf. Sci. Catal.2004, 154, 731-

    738. c) A. Cantin, A. Corma, M. J. Diaz-Cabanas, J. L. Jord, M. Moliner,J. Am.

    Chem. Soc. 2006, 128, 4216-4217. d) A. Corma, M.J. Diaz- Cabaas, M.

    Moliner, C. Martinez.J. Catal. 241, 2, 312-318, 2006.e) A.Corma, M. Moliner,

    J. M. Serra, P. Serna, M. J. Daz-Cabaas, L. A. Baumes, Chem. Mater. 2006,

    18, 3287-3296.

    [11] a) J. M. Serra, A. Corma, D. Farrusseng, L. A. Baumes, C. Mirodatos, C. Flego,

    C. Perego, Catal. Today, 2003, 81, 425-436; b) C. Klanner, D. Farrusseng, L. A.

    Baumes, C. Mirodatos, F. Schth, QSAR & Comb. Sci., 2003, 22, 729-736; c) D.

    Farrusseng, C. Klanner, L. A. Baumes, M. Lengliz, C. Mirodatos, F. Schth,

    QSAR & Comb. Sci.2005, 24, 78-93; d) F. Schth, L. A. Baumes, F. Clerc, D.

    Demuth, D. Farrusseng, J. Llamas-Galilea, C. Klanner, J. Klein, A. Martinez-

    Joaristi, J. Procelewska, M. Saupe, S. Schunk, M. Schwickardi, W. Strehlau, T.

    Zech, Catal. Today. 2006, 117, 284-290.

    [12] a) A. Corma, M. J. Daz-Cabaas, J. L. Jord, C. Martnez, M. Moliner,Nature.

    2006, 443, 842845; b) M. Moliner, M. J. Daz-Cabaas, V. Forns, C. Martnez,

    A. Corma,J. Catal. 2008, 254, 101-109.

    [13] a) L. A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos, QSAR & Comb. Sci.

    2004, 29, 767-778; b) C. Klanner, D. Farrusseng, L. A. Baumes, M. Lengliz, C.

    Mirodatos, F. Schth, Angew. Chem. Int. Ed. 2004, 43, 5347-5349; c) L. A.

    Baumes, J. M. Serra, P. Serna, A. Corma, J. Comb. Chem.2006, 8, 583-596; d)

    L. A. Baumes,J. Comb. Chem. 2006, 8, 304-314; e) J. M. Serra, L. A. Baumes,

    M. Moliner, P. Serna, A. Corma, Comb. Chem. High Throughput Screen.2007,

    10, 13-24; f) L. A. Baumes, M. Moliner, A. Corma, QSAR & Comb. Sci 2007,

    26, 255-272; g) P. Serna, L. A. Baumes, M. Moliner, A. Corma.J. Catal.2008,

    258, 25-34; h) L. A. Baumes, P. Collet, Com. Mat. Sci. 2008,

    doi:10.1016/j.commatsci.2008.03.051.

    [14] G. Barr, W. Dong, C. J. Gilmore,J. Appl. Cryst.2004, 37, 874-882.

    [15] a) D. J. Berndt, J. Clifford, in Advances in Knowledge Discovery and Data

    Mining(Ed.: U. M. Fayyad, et al.) , AAAI Press/MIT Press, Menlo Park, CA

    1996, pp. 229-248; b) E. Keogh, Data Mining and Machine Learning in Time

    Series Databases, at the 18th ACM SIGKDD Int. Conf. on Knowledge

    Discovery and Data Mining (KDD04). Seattle, WA, USA, 2004.

    [16] a) W. Lin, M. A. Orgun, G. J. Williams, Australasian Data Mining Workshop,

    Macquarie Univ. and CSIRO Data Mining, 2002; b) C. M. Antunes, A. L.

    Oliveira, Workshop on Temporal Data Mining, at the 7th Int. Conf. on

    Knowledge Discovery and Data Mining, San Francisco, CA, 2001.

    [17] G. Das, K. Lin, H. Mannila, G. Renganathan, P. Smyth, Global partial orders

    from sequential data, at the 4th Int. Conf. on Knowledge Discovery and Data

    Mining, New York, NY, 1998.

    [18] a) E. Keogh, C. A. Ratanamahatana,J. Knowl. Inf. Syst., 2005,7, 358386; b) B.

    K. Yi, H. V. Jagadish, C. Faloutsos, Proceedings 14th IEEE Int. Conf. on Data

    Engineering.1998, pp: 201 208.

    [19] N. Lesh, M. J. Zaki, M. Ogihara,IEEE Intell. Syst. App. 2000, 15(2), 48-56.

    [20] a) J. Buhler, M. Tompa, J. Comput. Biol. 2002, 9, 225-242; b) B. Chiu, E.

    Keogh, S. Lonardi,Probabilistic Discovery of Time Series Motifs, at the 9th Int.

    Conf. on Knowledge Discovery and Data Mining, Washington, DC, 2003.

    [21] a) J. C. Chappelier, M. Gori, A. Grumbach, in Sequence Learning: Paradigms,

    Algorithms and Applications (Ed.: R.Sun, G. L. Giles), Springer-Verlag, London,

    2001, pp. 105-134; b) D. L. James, R. Miikkulainen, Advances in Neural

    Processing Systems. 1995, 7, 577-584; c) P. Somervuo, T. Kohonen, Neural

    Process. Lett.1999, 10, 151-159.

    [22] a) F. R. Lin, L. S. Hsieh, S. M. Pan, Learning clinical pathway patterns by

    hidden markov model, at the 38th Annual Hawaii Int. Conf. on System Sciences

    (HICSS'05), Big Island, Hawaii, USA, 2005. b) J. Lin, M. Vlachos, E. Keogh, D.

    Gunopulos,Iterative incremental clustering of time series, at the IX Conf. onExtending Database Technology (EDBT 2004), Crete, Greece, 2004.

  • 8/14/2019 Chem a Eur J ATW Full Paper

    10/11

    10

    [23] E. Keogh, M. Pazzani, Scaling up Dynamic Time Warping for Datamining

    Applications, at the 6th Int. Conf. on Knowledge Discovery and Data Mining,

    Boston, MA, 2000.

    [24] E. Keogh, M. Pazzani, Derivative Dynamic Time Warping, at the 1st SIAM

    International Conference on Data Mining, Chicago, USA, 2001.

    [25] H. Sakoe, S. Chiba,IEEE T. Acoust. Speech. 1978, 26, 43-49.

    [26] a) D. E. Goldberg, The Design of Innovation: Lessons from and for Competent

    Genetic Algorithms, Kluwer Academic Publishers, Norwell, MA. 2002; b) L. M.

    Schmitt, Theory of Genetic Algorithms, Theoretical Computer Science, 2001,

    259, 1-61; c) M. D. Vose, The Simple Genetic Algorithm: Foundations and

    Theory, MIT Press, Cambridge, MA. 1999. d) D. Whitley.A genetic algorithm

    tutorial. Statistics and Computing. 1994, 4, 6585.

    [27] E. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana. 2006, The UCR Time Series

    Benchmark: www.cs.ucr.edu/~eamonn/time_series_data/

    Received: ((will be filled in by the editorial staff))Revised: ((will be filled in by the editorial staff))

    Published online: ((will be filled in by the editorial staff))

  • 8/14/2019 Chem a Eur J ATW Full Paper

    11/11

    11

    Making clearer the way towards

    new structuresLaurent A. Baumes

    *, Manuel

    Moliner and Avelino Corma*

    Design of a Full-Profile

    Matching Solution for High-Throughput Analysis of Multi-

    Phases Samples Through

    Powder X-Ray Diffraction

    Design of ATW

    methodology for

    automatically, quickly and

    reliably deciphering X-raydiffraction patterns