\section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference...

41
CLUSTER ANALYSIS Mike Baxter A version of this paper is published as: Baxter, M.J. (2008) Cluster analysis. In Liritzis I. (ed.), New Technologies in the Archaeognostic Sciences, Gutenberg Press, Athens, Greece, 445-481. (Paper and book are published in Greek) 1

Transcript of \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference...

Page 1: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

CLUSTER ANALYSIS

Mike Baxter

A version of this paper is published as:

Baxter, M.J. (2008) Cluster analysis. In Liritzis I. (ed.), New Technologies in the Archaeognostic Sciences, Gutenberg Press, Athens, Greece, 445-481. (Paper and book are published in Greek)

1

Page 2: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Introduction

Cluster analysis (CA) is a generic term that refers to a range of methods aimed at identifying groups in a set of data. In archaeology, to give only a few examples, CA has been used to group artefacts on the basis of their chemical compositions; assemblages on the basis of the similarity of their profiles; and to identify spatial clustering on the basis of the location of artefacts in space. This paper concentrates on the first of these applications, though most of the ideas to be discussed have general application.

The intention is to discuss the ideas that underpin different methods of CA in as non-mathematical a way as possible. Some notation and ideas are unavoidable, and are mostly discussed in the next section; more complex material is provided in the appendix. The section on model-based clustering is more demanding than other sections. There are many good books and articles on CA, and statistical software to implement the methods; a selective review is provided towards the end of the paper.

The heart of the paper discusses the main types of CA that have either been widely used in archaeometry, or could be used. Readers familiar with the subject will recognise that I have been highly selective, but I have tried to comment on what seem to be the most popular methods, as well as newer approaches that may be worth exploring.

Occasional reference is made to principal component analysis (PCA), and the reader will need a working knowledge of PCA to follow some of the discussion. As used here PCA allows high-dimensional data to be viewed through a series of two dimensional plots. Any text on multivariate analysis, some of which are referenced later, should provide an account.

Notation

For definiteness, assume we have n artefacts whose chemical composition has been measured with respect to p variables. The results may be collected in an n × p table of data, or data matrix, X, with typical element xij. The rows correspond to artefacts, and the term case will be used to refer to a row. Sometimes xi will be used to refer to the 1 × p vector of values for case i.

Let be the mean of variable j, and sj its estimated standard deviation. Usually the raw data matrix is modified in some way before CA. If is used, the variable is said to be centred. If is used the variable is standardized (the terms auto-scaled and normalized are also used, though the latter is best reserved for transformations that attempt to induce a normal distribution).

The modified data matrix arising from either of these treatments is Y, with typical element yij. In some approaches to CA the raw data are transformed to logarithms (to base

2

Page 3: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

10) before centring or standardization. How the data should be treated prior to CA is a non-trivial issue that is discussed later.

Many methods of cluster analysis result in the identification of G groups, with the hope that cases in a group are similar to each other and dissimilar from cases in other groups. This introduces the idea of (dis-)similarity, which is critical to an understanding of how many methods of CA work.

Many measures of (dissimilarity) can be defined, contributing to the many methods of CA available. A common measure of dissimilarity is Euclidean distance, or its square, the latter defined as

(yi – yj)(yi – yj)T (1)

where T indicates a vector transpose. Euclidean distance, dij, is just the generalization to p dimensions of distance as customarily measured in the real world.

The definition in terms of vectors is not really necessary here, but is unavoidable in presenting Mahalanobis distance (MD), important in some archaeometric applications of CA. For a single group, with estimated covariance matrix S, the MD between cases i and j is defined as

(yi – yj)S-1(yi – yj)T. (2)

The estimated covariance matrix, S, assumes some importance in later discussions. It is a diagonal p × p matrix. The entry in row i and column j is sij, the estimated covariance between variables i and j. The ith diagonal element may be written as sii or si

2 and is the estimated variance of variable i. The matrix is symmetric, so sij = sji. The notation Sg will sometimes be used to emphasise that the estimate is for a particular group g, with corresponding population covariance matrix, Σg.

The covariance measures the strength of relationship between two variables, but the actual values of the sij depend on the units of measurement, and can be difficult to interpret. For this reason it is often useful to scale it using rij = sij/sisj to get correlations, for which -1 ≤ rij ≤ 1. The p × p correlation matrix, R, has typical element rij, with the ith diagonal element (the correlation of a variable with itself) equal to 1.

If yj in the equation for MD is replaced by the vector of variable means (on the scale defined by the yij) then we obtain the distance between a case, yi, which may or may not be a member of the group, and the group centroid. Readers uncomfortable with the mathematics here should know that the merit of MD, compared to Euclidean distance, is that it makes allowance for the fact that variables may be correlated (common in archaeometric data) in a manner that may be beneficial for the clustering process. This is discussed in more detail later.

3

Page 4: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

For some methods of CA it is necessary to have a measure of how good a clustering of G clusters is. Start with a single cluster; ideally we would like this to be as ‘compact’ as possible, with individual cases close to the centroid (or mean) of the cluster. For a single case, i, and single cluster the overall closeness to the centroid, , can be measured by

. (3)

If there are ng cases in cluster g, an overall measure of compactness is

(4)

where the first summation is over the ng cases in the cluster. A measure of how good the clustering is can then be defined by summing Sg over the G clusters, to get SG as an overall measure. This is discussed further in the section on Ward’s method.

Some Approaches to Cluster Analysis

Hierarchical Clustering

Hierarchical agglomerative methods of CA are those most commonly used in archaeometry. Each case is initially treated as a single cluster so there are n in all. The two most similar cases are merged to form a cluster of two cases, giving (n – 1) clusters. Thereafter, clusters are successively merged with each other (treating cases as clusters) on the basis of which are most similar. Eventually all cases are merged into a single cluster.

It is also possible to start by assuming that all cases belong to a single cluster and then successively splitting clusters up, one case at a time, until all cases are distinct. This method, hierarchical divisive clustering, is not much used in archaeometry, and not considered further.

To merge clusters, a measure to determine how similar clusters are is needed. Similarity can be defined in different ways, contributing to the vast number of ways in which a cluster analysis can be carried out. In single-linkage analysis the similarity of two clusters is measured by the smallest distance between two cases, one from each cluster. The two clusters merged are those for which this smallest distance is smallest. In complete- linkage analysis, similarity is defined by the largest distance between two cases, one from each cluster, the clusters being merged for which this largest distance is smallest.

Single-linkage CA is rarely used in archaeometry because it tends to produce uninterpretable results unless the structure is obvious. It is sometimes useful for detecting

4

Page 5: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

outliers. A criticism of both methods is that the measure of similarity between clusters depends only on two cases, and fails to take account of group structure and other cases.

Average-linkage CA attempts to overcome this problem by defining similarity between clusters as the average distance between all possible pairs of cases, one from each cluster. It has probably been the most widely used method of CA in archaeometry.

The results from a hierarchical CA need to be interpreted. This is almost invariably done using a dendrogram or tree diagram, an example of which is shown in Figure 1.

6 5 4 9 1 2 22 18 12 10 11 19 13 14 3 15 20 27 25 26 8 7 16 17 24 21 23

01

23

45

Hei

ght

Figure 1: A dendrogram arising from an average-link cluster analysis of standardized data, for a 27 ×11 data matrix of medieval glass compositions. The data used are that given in Baxter (1989) and are a subset taken from Cox and Gillies (1986).

This is from an average-linkage analysis of a 27 × 11 matrix of standardized data of medieval glass compositions. Euclidean distance was used as the measure of dissimilarity. This can be thought of as a ‘tree’ with ‘branches’ and ‘leaves’ corresponding to the numbered cases on the horizontal axis.

The vertical axis shows the dissimilarity level at which cases or clusters merge. Cases that merge at a low level (e.g., 4 and 9) show a high level of chemical similarity.

5

Page 6: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Interpretation of dendrograms is usually subjective. Often what is done is to ‘cut’ the tree at a level of (dis-)similarity that isolates the main branches, the leaves associated with the branches defining the clusters. This is not always easy, even in this rather easy case.

Cutting the tree at a value of 5 results in three clusters; cutting at 4 results in three clusters and an outlier; cutting at just above 2 would result in four clusters and two outliers, with the main cluster to the right being split in two. There seems to be three fairly clear clusters and a possible outlier (20), but this should be checked if possible. The appearance of a dendrogram depends on the choice of style (see Figure 3), choice of method, and the distance measure used. Squared Euclidean, as opposed to Euclidean, distance will often result in dendrograms showing apparently clearer clustering. This interpretive strategy is commonly used. It is sometimes better, and legitimate, to cut different branches at different levels.

Methods of cluster analysis are designed to identify clusters and may do so, even when there are no groups in the data. Some form of cluster validation is, therefore desirable, and is discussed later. One simple and often effective way of confirming the interpretation of a dendrogram is to examine principal component (PC) plots on which the points are labelled. This is done in Figure 2, where case numbering is as in Figure 1. Inspection shows the three groups and outlier suggested by the CA are consistent with the PC plots. An alternative would be to label points according to the cluster they are in.

-3 -2 -1 0 1 2 3

-3-2

-10

12

component 1

com

pone

nt 2

12

3

456

78

9

1011 12131415

16

17

1819

2021

22

2324

252627

Figure 2: A plot based on the first two principal components (PCs) from a principal component analysis (PCA) of the data used to obtain Figure 1. The intention is to show the fairly clear clustering, but with possible outliers.

6

Page 7: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Ward’s method

Ward’s method has been widely used in archaeology, and was particularly popular in the 1970s and 80s. In archaeometric investigations, it is usually used as an exploratory hierarchical cluster technique in the same spirit as the linkage methods. It possesses distinctive characteristics, however, that provides a useful lead into other kinds of CA, so is discussed separately here.

Ward’s method is initiated in the same way as the linkage methods to give (n – 1) clusters, often using squared Euclidean distance as a dissimilarity measure. The quality of the clustering can be measured using the term SG, defined in the section on notation, where G = (n – 1). In the further merging of clusters, remembering that individual cases are clusters, the value of G is reduced by 1, while SG is increased. It is easy to show that any merge will increase SG and the merge is chosen for which this increase is least.

Users should be aware of several aspects of Ward’s method. The linkage methods described are essentially grouping algorithms with no firm basis in statistical theory. That they are widely used is presumably because they have seemed sensible to the people who devised them, and have found favour with practitioners. Ward’s method, by contrast, attempts to optimise an explicit objective function, SG.

All the agglomerative methods discussed suffer from the drawback that once a merge is made it cannot be undone. Ward’s method is no exception, and the word ‘attempts’ was used above because often SG will not be optimised. That is, given any specific partition into G clusters produced by Ward’s method, it may be possible to improve SG by reallocating cases between clusters. This is the basis of k-means methodology, discussed in the next section.

Users new to cluster analysis sometimes find that they like Ward’s method, compared to other alternatives, because it produces apparently clear and well separated clusters more readily. This can be a delusion – the method can suggest clusters quite clearly, even when none exist. This behaviour can be understood by viewing Ward’s method as a special case of a model-based method. Such methods are discussed later.

To illustrate the problem, Figure 3 shows the dendrogram for a Wards’ method analysis of standardized data. It is tempting to conclude that there are two clear clusters in the data (note the different style of presentation from that used in Figure 1). The data used were 50 cases randomly generated from a two-dimensional multivariate normal distribution. The observed correlation between the variables is 0.79. The data are plotted in Figure 4, with cases labelled according to which of the clusters they belong to. It is quite clear that the distinctive separation suggested by Figure 3 is misleading.

7

Page 8: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

39 1 30 26 25 40 2 47 19 21 15 48 16 24 4 38 17 12 44 11 439

31 42 14 18 22 356 45 50 5

28 46 10 33 3 207 36 34 29 41 49 8 13 23 32 27 37

05

1015

2025

30

Hei

ght

Figure 3: The dendrogram arising from a Ward’s method analysis of standardized data generated randomly from a bivariate normal distribution. This suggests, clearly but incorrectly, that there are apparently two distinct clusters.

This is not to say that Ward’s method will always produce poor results. It will tend to impose a certain kind of structure on the data, which can be understood in terms of the assumptions it implicitly embodies (see later for details). When these assumptions are satisfied it can work well. Empirical experience suggests that other methods that are not model-based can also impose inappropriate structure on data, but because of their lack of a theoretical grounding the reasons are less well understood than for Ward’s method.

K-means and related methods

As already noted, the idea behind Ward’s method is to try and minimise a particular objective function, SG, for any given level of clustering, G. In general, the method will not achieve the optimum, because merges that occur during the early stages of clustering cannot be undone.

8

Page 9: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

8 9 10 11 12 13

89

1011

12

x

y

1

1

2

1

2

2

2

2

1

2

1 1

2

2

1

1

1

2

1

2

1

2

2

1

11

2

2

2

1

1

22

2

2

2

2

1

11

2

1

1

1

2

2

11

2

2

Figure 4: The data used to generate Figure 3, labelled by the apparent clustering suggested by that figure.

One way round this problem is, given a choice of G, to reallocate cases between clusters in order to reduce SG. This reallocation proceeds in an iterative manner until no further reduction is possible. This is a particular example of k-means clustering.

The method has been used in archaeometric applications, but less so than one might think, given – in the context of Ward’s method at least – it can only improve a clustering. One possible reason for a relative lack of use, compared to hierarchical methods, is that a simple representation of the outcome in the form of a dendrogram is not possible.

Choice of the appropriate number of clusters is not straightforward. To apply the method for fixed G, a starting partition is needed to initiate the iterative reallocation procedure, and the final partition may be a local rather than global minimum. That is, results may depend on the starting position, so the choice of a good starting partition is helpful. Using clusters derived from an application of Ward’s method is one possibility.

As described above, k-means is based on squared Euclidean distance as a measure of dissimilarity. The idea can be extended to the use of general measures of dissimilarity. In k-medoids clustering, for example, a cluster centre is defined by a ‘typical’ value called the medoid . Cases are allocated to the cluster with the most similar medoid, this process proceeding iteratively. This was developed as a more robust alternative to k-means.

9

Page 10: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Figure 5 shows some output from a k-medoids analysis, using the partitioning around medoids (PAM) method of Kaufman and Rousseeuw (1990), with Euclidean distance as the dissimilarity measure. The data are the same as that used for Figure 1 and have been standardized.

This is an example of a silhouette plot, shown here for the 3 cluster solution. If a(i) is the average dissimilarity of case i from other cases in its cluster, and b(i) is the average dissimilarity of i with cases in the closest other cluster, then their difference, scaled to a maximum of 1, is the silhouette width for case i. Values near 1 are very well clustered; values near zero probably lie between two clusters; and negative values (of which there are none here) are probably in the wrong cluster. The silhouette width is given on the horizontal axis of the figure. Cases in clusters 1 and 2 are generally well-clustered, apart from case 22 in cluster 2. Cluster 3 is less well-defined (the silhouette widths are generally smaller) and case 20 is not well clustered.

Given G, the methods discussed so far lead to each case being assigned to just one cluster. This is sometimes called a crisp clustering. In fuzzy clustering the membership of a case may be spread over several clusters. Slightly more formally, is the membership of case i in group g, with . Estimation of the can be achieved by a fuzzy k-means algorithm (also called fuzzy c-means), of which there are several. Fuzzy clustering has been little used in archaeology (the appendix gives more technical detail).

Figure 5: An example of a silhouette plot, used in conjunction with the partitioning around the medoids (PAM) method of Kaufman and Rousseeuw (1990).

2021237

1624178

262527221812111910143

1315652491

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot

Average silhouette width : 0.65

n = 27 3clustersCjj : nj | avei Cjsi

1 : 6 | 0.82

2 : 10 | 0.73

3 : 11 | 0.49

10

Page 11: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Model-based methods

A known problem with Ward’s method is that it will tend to produce spherical clusters of roughly equal size. The same phenomenon is sometimes observed with other clustering algorithms. This can be a problem with certain kinds of material studied in archaeometry, where the variables can be expected to be correlated (see Harbottle (1976) for a discussion of this), so that any clusters in the data can be expected to be (hyper)-ellipsoid in shape, with no prior expectation that they are of the same size. Figures 3 and 4 illustrated the problem for a very simple data set showing moderate correlation.

In principle, one way to avoid this problem is to use model-based methods. In such studies, for G clusters, assume that, the gth cluster is sampled from a multivariate normal distribution (MVN) with mean μg and covariance matrix Σg, which can be written as MVN(μg , Σg). A mixture model is obtained by assuming that the observed sample is drawn from a mixture of these multivariate normal distributions.

One approach is to estimate the parameters of the distributions, including the mixing proportions, πg, by maximum likelihood. Given the estimates, the relative probabilities of cases belonging to the gth component can be determined, and cases can be assigned to the cluster for which they have the highest probability, if a crisp clustering is needed.

In classification maximum likelihood, cases are associated with labels, initially unknown, that identify the cluster to which they belong. These labels are estimated and provide a direct clustering of the data.

Some details are given in the appendix. The methodologies can be implemented in open-source software discussed later.

Bayesian methods of CA add extra structure to the mixture model in the form of prior distributions for the unknown parameters in the model. This added complexity places the methodology beyond the reach of the average non-statistical researcher, unless they can find a suitable statistical collaborator. Bayesian CA has not been widely applied in archaeometry; references to its use are provided later.

The methods described so far have developed in the statistical literature. The framework sketched above provides a useful basis for describing methods developed by archaeometricians, to deal with what they see as the peculiarities of archaeometric data. The main idea now described, which is quite simple, can be implemented in more than one way.

Determine a provisional grouping into G clusters, by any method that seems suitable. Measure the Mahalanobis distance (MD) of every case to each cluster centroid and, if necessary, re-allocate the case to the group with the nearest centroid. Repeat this process until a stable clustering is obtained. This is similar to the idea used in k-means clustering

11

Page 12: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

as described earlier, but the use of MD means that the potential ellipsoidal nature of clusters is accounted for.

One refinement, when calculating the MD of a case i to its own group centroid, is to use ‘leave-one-out’ methods, where the centroid is calculated omitting i. Another refinement is to assume that, within clusters, data have an MVN distribution. This allows the MDs to be converted to probabilities, so that the strength of cluster membership can be assessed. Where the probabilities of cluster membership are sufficiently low for all clusters a case can be declared an outlier, and possibly omitted from subsequent iterations.

These more specifically archaeometric approaches have mainly been developed in a small number of laboratories (e.g., Bieber et al. 1976, Glascock 1992, Beier and Mommsen 1994, Neff 2002), and many published applications – to be found in the pages of Archaeometry and the Journal of Archaeological Science among others – involve these authors and their co-workers. Implementation of the basic idea differs, and the reader is referred to the papers cited for fuller details. Baxter (2001a) and Baxter (2003: 97-99) summarise some of the differences in approach.

These methods can be viewed as model-based, to the extent that they depend on the MVN assumption to exploit their full power. My own (unpublished) experience of using these methods is that their application is as much ‘art’ as ‘science’, since a lot of decisions need to be made (e.g., numbers of clusters; starting partitions; outlier identification) that may require judgements that could differ from researcher to researcher.

A lot of practical issues have been avoided in the account given above. Issues concerning data transformation (to achieve normality), treatment of outliers and the determination of the numbers of clusters are discussed, in general terms, in the next section.

The main limitation, particularly when a large number of variables (say 20-30 in this context) are involved, is the sample size requirement. For those methods explicitly based on MD, a minimum requirement is that group sizes, ng, be greater than p. Unless ng is somewhat greater than this, results can be unstable – because estimates of the covariance matrices, Sg, are unstable. Guidelines vary but, for example, ng/p > 3 has been suggested as a minimum.

A little thought will suggest that for many data sets, typical in archaeometric analysis, the full power of the methodology is not available. For n = 100, for example, and p = 20, and G = 5, at least some group sizes will be too small to allow MD to be used.

Similar, and related, difficulties apply to the other model-based methods discussed. For data sets with typical p, constraints have to be imposed on the form of the covariance matrices, as there are otherwise more parameters to estimate than cases. Assuming equal covariance matrices is one common strategy, and this implies that clusters should be ellipsoidal, with the same orientation and similar size. If it is further assumed that the covariance matrices are diagonal, so where I is the p × p identity matrix, this

12

Page 13: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

amounts to an assumption that clusters are spherical and of equal size. It can be shown that this is essentially equivalent to Ward’s method, which can thus be viewed as a model-based method that is pre-disposed to finding such clusters.

Some of the methods discussed have associated with them methods for testing hypotheses about the number of clusters in the data. They can also, in principle, cope with the ellipsoidal nature of clusters that typifies some kinds of archaeological data. Apart from their relative mathematical complexity, the main barriers to their wider use are practical. They are worth further investigation, but a willingness to engage with the mathematics, or a suitably qualified collaborator, is desirable.

Issues in Cluster Analysis

In this section a number of related issues that users of CA need to bear in mind, when carrying out and reporting an analysis, are discussed.

Data transformationPrior to a CA some form of data treatment is needed. The most commonly used treatments in archaeometry are standardization, or transformation to base 10 logarithms, without subsequent standardization. Standardization is undertaken to give variables equal weight, so those with large variances don’t, predictably, dominate an analysis. Logarithmic transformation will produce new variables with a similar order of magnitude, and some researchers assert that the transformed variables, particularly if they are trace elements, are more likely to have a normal (Gaussian) distribution within clusters. This can be of importance in model-based analyses where the normality assumption is used in the analysis.

Standardization of log-transformed data is sometimes used, but will often give similar results to using standardization without transformation. This is because there is a monotonic relationship between the raw and logged data, and standardization will tend to convert values on either scale to a similar range of values. The exception to this generalization is when transformation either down-weights the effect of cases that are outliers on the original scale; or creates outliers not present on the original scale from cases with values close to zero.

Archaeometric data of the kind discussed here are an example of compositional data. For fully compositional data, xij ≥ 0, and for fixed i, the xij sum to 100%, assuming all measurements are in %. A sub-set of such data is sub-compositional. For j = 1, …, (p-1) it is possible to define ratios of the form xij/xip and base analyses on these or their logarithms. This has been done, and debated, since the 1960s/70s (Wilson 1978) and, for statistical theoretical reasons it has been argued that this is the only correct approach (Aitchison et al. 2002).

For analyses based on trace elements alone log-ratio analysis (LRA), as it is called, is equivalent to the use of log-transformed data. For analyses that include major and minor

13

Page 14: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

elements/oxides the theoretical merits of LRA can be outweighed by poor performance in terms of interpretability of the results obtained compared to the use of standardized data. Baxter and Freestone (2006) discuss the issues involved, and provide references to archaeometric applications of LRA.

Choice of methodMethods of hierarchical agglomerative CA are the workhorse of archaeometric applications. As a minimum, the method used, including the measure of dissimilarity chosen and decisions about data standardization or transformation, should be reported in applications. Comment on why a particular approach was preferred is desirable, but is often omitted. There is, in fact, little theoretical reason for choosing between the more popular methods of CA, so that pragmatic considerations are acceptable, but they should be stated.

It is unacceptable to try a variety of methods and report only that which gives ‘good’ results. It is easy, given the capabilities of modern software, to apply a variety of methods to a data set so that it is pointless to urge researchers not to do this. What is important is the honesty with which results are reported. If different analyses lead to similar conclusions it is worth stressing this, since it tends to strengthen the conclusions. If different methods lead to different results further investigation is necessary. It is possible that different aspects of the data are being revealed, but also possible that the apparent structure revealed by some methods is illusory.

One approach is to do an initial analysis using Ward’s method, which tends to produce more easily ‘interpretable’ results than other methods. Given an initial clustering determined in this way, output from other methods can be examined to see whether the same structure is apparent, albeit in a possibly more ‘noisy’ fashion.

With the caveats about sample and cluster sizes, output from a hierarchical analysis can be used as the starting point for the iterative reallocation procedures discussed, or for suggesting an appropriate number of clusters to investigate in a model-based method.

Variable selectionBaxter and Jackson (2001) discuss the issue of variable selection and only the most important points are reviewed here. Variable selection is inevitable; the analytical techniques used to generate compositional data invariably measure only a subset of the elements in the periodic table. Before statistical analysis some of these are often discarded, because of poor precision; more-or-less constant values; too many values below the level of detection etc.

Once this selection has taken place, implied and overt, a common view is that as many variables as possible should be used in statistical analysis. This is wrong. Assume there are clusters in a set of data, and use the term ‘structure-carrying’ to refer to those variables that help distinguish between at least two clusters (most simply, if there were two clusters, then a variable with a bimodal distribution associated with the two clusters would be structure carrying). It is easy to construct artificial examples where the effect of

14

Page 15: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

a large number of non-structure-carrying variables overwhelms the influence of the structure-carrying variables, so that clusters may not be detected. There is no reason why this should not be an issue with real data.

Identifying a problem is one thing; dealing with it is another. Statistical research on this topic has not carried over to the archaeometric literature. Mathematically inclined archaeometricians who find the problem interesting could usefully start with Friedman and Meulman (2004). The less ambitious should be aware of the issue and be prepared to used their subject-based knowledge to identify variables which are likely to be structure-carrying, or not.

A fairly simple tactic, if tedious with a large number of variables, is to look at all possible pairwise plots of the variables. This will often provide an indication of whether or not there are clusters in the data, and which variables are most useful for identifying them. If there are obvious and large clusters it is often useful to extract them from the data set and repeat the above procedure on them. This may serve to identify further sub-clusters, associated with variables not identified as structure-carrying in the first pass through the data.

OutliersIn CA an outlier can be defined, loosely, as a case that is distant from, or cannot be comfortably associated with, any of the clusters in the data. The presence of outliers in a set of data can, in principle, distort the appearance of the dendrogram in hierarchical CA, and invalidates the normality assumption used in model-based methods. It is sensible, therefore, to try and identify the more obvious outliers in a set of data and remove these before further analysis (for separate discussion), if the main aim is to identify groups in a data set. Where less obvious outliers are identified in the course of analysis, it is often sensible to remove these as well, and proceed in an iterative fashion. Fuller discussion of some of the points now raised is given in Baxter (1999).

Given the extensive literature on outlier detection in the statistical literature, surprisingly little is directly relevant to archaeometric problems. This is because much of it is concerned with detecting outliers relative to what is otherwise a single cluster of data. Identifying outliers relative to several clusters, where these are initially unknown, and where their definition may be affected by outliers, has received much less attention. Relatively informal methods of outlier detection are often quite effective.

Univariate and bivariate inspection will usually serve to identify gross outliers. Plots based on the first few principal components (PCs) will also identify the more obvious outliers, and some less obvious. Since clear outliers can distort the appearance of plots based on the PCs it helps to remove them and repeat the process iteratively, to identify more ‘subtle’ outliers. The principle at work here is that cases that are distant from the bulk of the data on the first few PCs will be distant in the full p-dimensional space.

The process just described can lead to the identification of a relatively large number of outliers. In the study by Papageorgiou et al. (2001) of the compositions of 130 specimens

15

Page 16: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

of Late Roman Cooking Ware from the Balearic Islands, 22 cases were identified as outliers and removed prior to the application of various clustering methods.

The detection of outliers is ‘built-in’ to those model-based methods that use Mahalanobis distance. A good example is provided by the study of Olmec pottery production in Blomster et al. (2005), in which 188 out of 944 cases were judged to be outliers, or were not assigned to clusters. Some researchers are uneasy about the removal from the final analysis of a lot of outliers or of any outliers at all, on the grounds that data are being ignored or manipulated to get results congenial to the investigator. If a primary aim of an analysis is to identify the main groups or pattern in a set of data, this concern seems to me misguided. There is no logical reason why all cases should be assignable, with reasonable confidence, to a cluster, and no logical reason for expecting only a very small number of outliers (relative to the main clusters) in a large data set.

Number of clusters and cluster validationFor all the methods discussed, a decision has to be made about the number of clusters, G, to report and interpret. With obvious structure in a data set, different methods are likely to lead to the same conclusions (with the caveat that clusters need to be large enough for some of the model-based methods to be used). Often CA may not even be necessary in these circumstances.

For the hierarchical methods a decision is often made on the basis of the appearance of the dendrogram and we have seen, in Figures 3 and 4, that this can be misleading. It is common to cut the tree at a particular level, but often better to cut at different levels to isolate the more (visually) distinct branches.

Formal approaches to determining the numbers of clusters in a set of data are discussed, for example, in Everitt et al. (2001: 177-196), but have been little used in archaeometry. Similarly, some of the model-based methods are associated with tests for the number of clusters, but have been little used. More informal methods graphical methods are often useful.

For example, suppose a range of possible values of G are suggested by a CA, which may be any of those discussed above. Carrying out a principle component analysis (PCA) and producing component (PC) plots, labelled by cluster membership, for competing values of G will often be informative. If G is too small then groups separated on the PC plots may have the same labels; if G is too large cases within the same group suggested by the PC plots may have different labels.

The more obvious structure will often be apparent on a plot of the first two PCs, but it is worth inspecting all possible pairwise plots for, say, the first four or five PCs. This is because some of the clusters suggested by a CA may not obviously separate out on the first two PCs, but do so using the others.

Another useful tactic is to strip out very obvious outliers and clusters from the data (those that are suggested by the CA and clearly distinct from the rest of the data) and repeat both

16

Page 17: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

the CA and inspection using PCs with what remains. The aim here is to see if the structure suggested in the original analysis remains, or whether other structure, obscured in the original analysis, exists.

These informal methods are not foolproof, but often work well in application. Sometimes the iterative application of PCA, independently of CA, is sufficient to reveal the stricture in the data, and it can be viewed as an informal approach to CA, if used in this way.

Further Reading

Books written specifically for archaeologists, that discuss CA, include, in ascending order of difficulty, Shennan (1997), Baxter (1994) and Baxter (2003). Not all the methods discussed in this paper are covered in these books.

General statistical texts, with a wider coverage, include Everitt et al. (2001), which is devoted to CA, accessible, and includes archaeological examples. Gordon (1999), at a more advanced level, is devoted to the subject of classification.

Good statistical texts on multivariate analysis, with treatments of CA, abound. They include Manly (2004), Everitt and Dunn (2001), Krzanowski (2000), Krzanowski and Marriott (1995), and Seber (1984). This is in rough order of difficulty, Manly’s text being the most introductory.

Articles evaluating or developing aspects of the use of multivariate methods in archaeometry include Bieber et al. (1976), Pollard (1986), Glascock (1992), Beier and Mommsen (1994), Baxter and Buck (2000), Baxter (2001b) and Neff (2002). Several of these outline approaches that have been developed in particular laboratories.

For archaeometric applications of CA, much of it very ‘standard’, with an often cursory discussion of CA, the journals Archaeometry, the Journal of Archaeological Science, and the published proceedings of Archaeometry conferences are good sources. Baxter (1994; 79-81) has a (now dated) review. Papers written or co-authored by the researchers noted in the previous paragraph are often of more than average interest.

For ‘newer’ approaches to cluster analysis, that have mostly had limited application in archaeometry, Everitt et al. (2001) is probably the most accessible statistical text for a non-statistical readership and includes material on mixture models, k-medoids and fuzzy CA. Ripley (1996), Hastie et al. (2001) and Webb (2002) are all good, but at a more advanced level. They are primarily concerned with methods of supervised pattern recognition (e.g., discriminant analysis, classification trees, neural networks), but have chapters on unsupervised pattern recognition that cover many of the newer methods.

Kaufman and Rousseeuw (1990) present a variety of methods for robust CA, including PAM. Some of the text, particularly computational details, is outdated, but the book

17

Page 18: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

provides useful background for the implementation in R (see below). See, also, Struyf et al. (1996).

Fuzzy CA has been little used, to date, in archaeometry. The examples given in Everitt et al. (2001) and Baxter (2006) use real archaeometric data, but are essentially illustrative.

Banfield and Raftery (1993) is a useful starting point for a statistical treatment of model based clustering, and Fraley and Raftery (2007) provide updated computational information. Fraley’s website (http://www.stat.washington.edu/fraley/ - accessed 20/02/07) is a useful resource for keeping track of developments. Papageorgiou et al. (2001) is one of the more detailed explorations of the merits of model-based methods as applied to archaeometric data. Hall (2004) and Hall and Minyaev (2002) provide other applications.

Buck et al. (1996) is the best starting point for an exposition of the uses of Bayesian methods in archaeology, with references to, and applications of, CA to archaeometric data. Their pioneering work has not been emulated much, exceptions being Dellaportas and Papageorgiou (2006) and Papageorgiou and Lyritzis (2007).

Software

The general purpose statistical software packages that I am familiar with, of the kind used for teaching statistics to non-specialists, are typically menu driven and include a range of hierarchical and k-means methods among their options. Everitt et al. (2001) review what was current in about 2000, and is inevitably a bit dated. In particular the open-source software, R (see below), is not discussed.

Of commercially available software it is worth singling out CLUSTAN, developed by David Wishart and distributed by Clustan Ltd. (http://www.clustan.com/ - accessed 09/03/07). It is a specialised package for CA, with many more options that the general purpose packages. I have no recent personal experience of using it, but earlier versions were quite widely used in archaeology.

Non-specialist, menu-driven, software can be restrictive, both in the options allowed and the control one has over the presentation of results, which can be unsatisfactory. To obtain more control over presentation, and explore some of the more complex methodologies available, more powerful software is needed, and the open-source package, R, would be the current choice of many statisticians.

R is command-driven (as opposed to menu-driven) with powerful graphical and programming facilities. It is developed and maintained by the R Development Team, and can be obtained from http://cran.r-project.org/ (accessed 09/03/07). It is updated on a regular basis; version 2.4.1 is current at the time of writing (now 2.9.2 when revising this).

18

Page 19: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

For non-statisticians used to menu driven packages, coming to terms with R can be initially difficult, but it is worth the effort. Apart from the comprehensive documentation, Dalgaard (2002) is a good general introduction, while Venables and Ripley (2002) is more advanced and has a section showing how some of the methods discussed here can be applied.

A major attraction of R is that there are a large number of packages, designed by users for specific tasks, which can be installed in addition to what comes with R, and this includes packages for CA. The simpler hierarchical methods are available in the stats package that comes with R; type ?hclust from within R to get information on what is available. Similarly, ?kmeans provides help on k-means clustering.

Among available add-on packages cluster implements the robust methods of Kaufman and Rousseeuw (1990), including k-medoids clustering using the pam function, and a robust version of fuzzy CA using the fanny function. Fuzzy CA is also available in the package e1071. The package mclust02 provides a variety of functions for model-based CA. The approaches to clustering designed by archaeometricians, discussed in the section on model-based clustering, are not immediately available, but could be programmed. I am not aware of R packages for Bayesian CA. Other R packages to do cluster analysis are available, some of which implement quite new methodology.

Example

To illustrate some of the ideas discussed above a sample of fragments from 34 Romano-British cast glass bowls, measured with respect to the composition 11 oxides will be used. The data are given in Baxter et al. (2005: 64). Oxides will be referred to by their associated element, and Si, obtained by differencing in the original paper, is not used here. Numbering is from 1-34, rather than the identifications in the original paper.

Initial, univariate, data inspection is illustrated in Figures 6 and 7, which are dotplots for Fe and Na. There is some suggestion of grouping in Figure 6, suggesting there may be clusters in the data. Grouping is possibly suggested in Figure 7 but is less obvious. Case 9 to the extreme left appears to be an outlier, but is not so extreme as to warrant immediate exclusion from the analysis. One other case, 5, was a similar kind of outlier for K and Al, but is also retained.

Figure 6: A dot-plot of Fe for the cast-bowl compositional data.

19

Page 20: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Figure 7: A dot-plot of Na for the cast-bowl compositional data.

Inspection of other dot-plots showed that Mn, P, Pb and Ti took on few values with no evidence of clustering, so these were omitted from further analyses. This leaves seven variables, Al, Ca, Fe, K, Mg, Na and Sb. A pairs plot (also called a draftsman plot, or scatterplot matrix) is shown for these variables in Figure 8. Such plots show all possible bivariate plots for the variables selected, the upper triangle of plots being the same as the lower triangle, except that axes are reversed. There is a suggestion of grouping in several of these plots, particularly those involving Fe, and a number of outliers are evident.

Al

0.30 0.50 4.5 6.0 7.5 0.35 0.55

1.6

1.9

0.30

0.50

Fe

Mg

0.35

0.50

4.5

6.0

7.5

Ca

Na

1618

20

0.35

0.55 K

1.6 1.9 0.35 0.50 16 18 20 0.6 1.0

0.6

1.0

Sb

Figure 8: A pairs plot for the seven variables used for the cluster analysis of the cast-bowl compositional data.

20

Page 21: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

429 33 34

717 30 31

1411 25 19 20

518 27 16 22

924 32 6 2112

15 1 2 8 28 23 10 133 26

05

1015

2025

30

Hei

ght

1411 25

429

33 347

1730 31

3 269

12 3213

8 2810 23

18 2716 22

2415

1 26 21

519 20

01

23

4

Hei

ght

Figure 9: Ward’s method (top) and average-linkage dendrograms from the analysis of the standardized cast-bowl data, using Euclidean distance as a measure of dissimilarity.

21

Page 22: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

In Figure 9 the dendrograms from some initial cluster analyses are shown, Ward’s method for standardized data at the top, and average linkage beneath. Euclidean distance was used as the measure of similarity. Ward’s method, suggests, most clearly, two clusters, but could be cut to give three or four apparently distinct groups. We have seen how this can mislead.

For the two cluster solution, that to the left consists of cases (4, 7, 11, 14, 17, 25, 29, 30, 31, 33, 34), that is, 11 cases in all. Call this ‘cluster 1’. If the average linkage dendrogram is cut to give two clusters, the same cluster is obtained with the addition of cases 3 and 26 that appeared to the extreme left of the Ward’s method dendrogram. These, and case 14, seem somewhat outlying relative to other cases in the cluster.

If the upper dendrogram in Figure 9 is cut to give four clusters, careful inspection of the lower dendrogram shows that, other than cluster 1, these are not closely matched, though some groups of cases such as (16, 18, 22, 27) do group similarly in both analyses. It is usually easier to do this sort of comparison by labelling cases according to the cluster membership suggested by the first analysis rather than by case number.

These initial analyses suggest that there are, perhaps, two main groups in the data, with a number of outliers (or not easily clustered cases), indicated in the average linkage analysis. Analysis could proceed in various ways at this point – for example, by labelling points on the pairs plot by cluster membership, for G = 2, 3, 4. This approach will be illustrated shortly, using principal components rather than the original variables, but first some k-means analysis is undertaken.

Using the kmeans function from R, for G=2, gave clusters of size 13 and 21, the smaller of these being cluster 1 plus (3, 26). Clustering was initiated using random starts, but these did not affect the results. For G = 3 and G = 4 the random starts did affect results. For 100 random starts the best results produced clusters of size 13, 11 and 10 for G = 3, the first of these being identical with the cluster of 13 for G = 2. For G = 4 the same procedure produced clusters of size 11, 2, 10 and 11, with the first two of these splitting the previous clusters of size 13 into cluster 1 and (3, 26). Carrying out a PCA and looking at the pairs plot for the first three PCs, labelled by the four-cluster solution, produces Figure 10.

22

Page 23: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Scatter Plot Matrix

Comp.10

20 2

-4

-2

-4 -2

Comp.20

1

2 0 1 2

-2

-1

0

-2 -1 0

Comp.3

1

2 1 2

-1

0

-1 0

group1234

Figure 10: A pairs plot, based on the first three principal components of the standardized glass bowl data, labelled by the cluster membership derived from a k-means CA with four clusters.

Cluster 1 is associated with the crosses and (3, 26) with the triangles. The cases in cluster 1 plot tightly together with the exception of an outlier (relative to the cluster) evident on the third component. This is case 14, which was suggested as unusual in the average-link analysis. Case 14 has the highest values of Ca and Sb in the total sample, and the lowest value of K within cluster 1. The affinity of (3, 26) with each other and cluster 1 is evident, but they plot slightly separately on the plot for the first two components. There is little evidence that the remaining two clusters are genuinely separate.

Using k-medoids clustering gives the same results as k-means for G = 2 and 3; for G = 4, case 26 is separated from case 3 and cluster 1, but a silhouette plot suggests that it is probably not a member of the cluster to which it is assigned. A silhouette plot for G = 3 suggests that cases 3, 14, and 26 lie between clusters, rather than belonging securely to the group they are assigned to.

23

Page 24: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Fuzzy CA, implemented using the fanny function with m = 2 in the cluster package produced a different crisp clustering, for G = 2, from those previously obtained, adding case 12 to cluster 1 and (13, 26). Its membership coefficient was, however, only 0.52 (contrasting with 0.57 for cases 13 and 26, and 0.59 for case 14). Using G = 3 removed case 12 from the cluster.

The overall picture is that there is one fairly tight cluster of 10 cases in the data, and another more diffuse one, with (3, 26) also forming a small group having affinities with the tight cluster. A number of cases, including 14, are not readily assigned to either of the main clusters.

This example has been designed to illustrate some of the ideas mentioned in the text. Rather than relying on a single method, the aim has been to show how the simpler methods can be used to explore the data in a relatively informal way. The more complex methods of CA have not been illustrated, but pose similar problems in assessing the number and validity of clusters. The data set used is a relatively small one. Larger data sets, or data with less clear structure, pose more problems but, with care, the ideas used here can be applied.

Conclusion

Anyone reading this paper with little experience with CA, who thinks it might be useful for their data, is advised to start with the simpler hierarchical techniques. For more experienced, or adventurous, users there are several avenues that could be explored.

It would be useful, for example, to have a systematic investigation of how fuzzy CA or robust methods of CA perform across a range of archaeometric data sets. Similar comments apply to some of the more complex model-based methods, though I suspect sample and cluster size will be problematic.

Cluster analysis has been the most widely used method of multivariate analysis in archaeology. It has not always been used well or interestingly, and some researchers now treat it as a starting point, rather than end-point, of their statistical analysis. The widespread use of CA, nevertheless, presumably reflects the fact that archaeologists and archaeometricians have found it useful. Greater awareness of the potential problems in applying CA, and in some cases better reporting of the results, would be desirable, but CA can be a useful tool and is likely to remain as a commonly applied technique.

References

Aitchison, J., Barceló-Vidal, C., and Pawlowsky-Glahn, V. (2002) Some comments on compositional data analysis in archaeometry, in particular the fallacies in Tangri and Wright's dismissal of logratio analysis, Archaeometry, vol.44, 295-304.

24

Page 25: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Banfield, J.D. and Raftery, A.E. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, vol. 49, 803-821.

Baxter, M.J. (1994) Exploratory multivariate analysis in archaeology, Edinburgh University Press, Edinburgh.

Baxter, M.J. (1999) Detecting multivariate outliers in artefact compositional data. Archaeometry, 41, 321-338.

Baxter, M.J. 2001a: Statistical modelling of artefact compositional data. Archaeometry, vol. 43, 131-147.

Baxter, M.J. (2001b) Multivariate analysis in archaeology. In Handbook of archaeological sciences, D.R. Brothwell and A.M. Pollard (eds.), Wiley, Chichester, 685-694.

Baxter, M.J. (2003) Statistics in archaeology, Arnold, London.

Baxter, M.J. (2006) Supervised and unsupervised pattern recognition in archaeometry. Archaeometry, vol. 48, 671-694.

Baxter, M.J. and Buck, C.E. (2000) Data handling and statistical analysis. In Modern analytical methods in art and archaeology, E. Ciliberto, and G. Spoto, G. (eds.), Wiley, New York, 681-746.

Baxter, M.J. and Freestone, I.C. (2006) Log-ratio compositional data analysis in archaeometry. Archaeometry, vol. 48, 511-531.

Baxter, M.J., Cool, H.E.M. and Jackson, C.M. (2005) Further studies in the compositional analysis of colourless Romano-British vessel glass. Archaeometry, vol. 47, 47-68.

Baxter, M.J. and Jackson, C.M. (2001) Variable selection in artefact compositional studies. Archaeometry, vol. 43, 253-268.

Beier, T. and Mommsen, H. (1994) Modified Mahalanobis filters for grouping pottery by chemical composition. Archaeometry, vol. 36, 287-306.

Bieber, A.M., Brooks, D.W., Harbottle, G., and Sayre, E.V. (1976) Application of multivariate techniques to analytical data on Aegean ceramics. Archaeometry, vol. 18, 59-74.

Blomster, J.P., Neff, H. and Glascock, M.D. (2005) Olmec pottery production and export in ancient Mexico determined through elemental analysis. Science, vol. 307, 1068-1072.

Buck, C.E., Cavanagh, W.G. and Litton, C.D. (1996) Bayesian approach to interpreting archaeological data, Wiley, Chichester.

25

Page 26: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Cox, G.A. and Gillies, K.J.S. (1986) The X-ray fluorescence analysis of medieval durable blue soda glass from York Minster. Archaeometry, vol. 28, 57-68.

Dalgaard, P. (2002) Introductory statistics with R, Springer, New York.

Dellaportas, P. and Papageorgiou, I. (2006) Multivariate mixtures of normals with unknown number of components. Statistics and Computing, vol. 16, 57-68.

Everitt, B.S. and Dunn, G. (2001) Applied multivariate data analysis, 2nd edition, Arnold, London.

Everitt, B.S., Landau, S., and Leese, M. (2001) Cluster analysis, 4th edition, Arnold, London.,

Fraley, C. and Raftery, A.E. (2007) Model-based methods of classification: using the mclust software in chemometrics, Journal of Statistical Software, vol. 18, Issue 6.

Friedman, J.H. and Meulman, J.J. (2004) Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society B, vol. 66, 815-849.

Glascock, M.D. (1992) Characterization of archaeological ceramics at MURR by neutron activation analysis and multivariate statistics. In Chemical characterization of ceramic pastes in archaeology, H. Neff (ed.), Prehistory Press, Madison, WI,11-26.

Gordon, A.D. (1999) Classification, 2nd Edition, Chapman and Hall/CRC, London.

Hall, M.E. (2004) Pottery production during the Late Jomon period: insights from the chemical analyses of Kasori B pottery. Journal of Archaeological Science, vol. 31, 1439-1450.

Hall, M.E. and Minyaev, S. (2002) Chemical Analyses of Xiong-nu Pottery: A Preliminary Study of Exchange and Trade on the Inner Asian Steppes. Journal of Archaeological Science, Vol. 29, 135-144.

Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning, Springer, New York.

Harbottle, G. (1976) Activation analysis in archaeology. Radiochemistry, vol. 3, 33-72.

Kaufman, L. and Rousseeuw, P.J. (1990) Finding groups in data, Wiley, New York.Krzanowski, W.J. (2000) Principles of multivariate analysis, 2nd edition, Oxford University Press, Oxford.

Krzanowski, W.J. and Marriott, F.H.C. (1995) Multivariate analysis: classification, covariance structures and repeated measurements, Edward Arnold, London.

26

Page 27: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

Manly, B.F.J. (2004) Multivariate statistical methods: a primer, 3rd edition, Chapman and Hall/CRC, Baton Rouge, FL.

Neff, H. (2002) Quantitative techniques for analyzing ceramic compositional data. In Source determination by INAA and complementary mineralogical investigations, D.W. Glowacki and H.Neff (eds.), Monograph 44, The Cotsen Institute of Archaeology at UCLA, Los Angeles, 15-36.

Papageorgiou, I.  and Lyritzis, I. (2007) Multivariate mixture of normals with unknown number of components. An application to cluster Neolithic Ceramics from Aegean and Asia Minor.Archaeometry, vol. 49, 795-813.

Papageorgiou, I., Baxter, M. and Cau, M.A. (2001) Model-based cluster analysis of artefact compositional data. Archaeometry, vol. 43, 571-588.

Pollard, A.M. (1986) Multivariate methods of data analysis. In Greek and Cypriot pottery: a review of scientific studies, R.E. Jones (ed.), British School at Athens Fitch Laboratory Occasional Paper 1, Athens, 56-83.

Ripley, B. D. (1996) Pattern recognition and neural networks, Cambridge University Press, Cambridge.

Seber, G.A.F. (1984) Multivariate observations, Wiley, New York.

Shennan, S. (1997) Quantifying Archaeology, 2nd edition, Edinburgh University Press, Edinburgh.

Struyf, A., Hubert, M. and Rousseeuw, P.J. (1996) Clustering in an object-oriented environment. Journal of Statistical Software, vol. 1, 1-30.

Venables, W.N. and Ripley, B.D. (2002) Modern applied statistics with S, 4th edition, Springer, New York.

Webb, A.R. (2002) Statistical pattern recognition, 2nd edition, Wiley, New York.

Wilson, A.L. (1978) Elemental analysis of pottery in the study of its provenance - a review. Journal of Archaeological Science, vol. 5, 219-236.

Appendix

Fuzzy CAAs with k-means cluster analysis, in fuzzy clustering an objective function is minimized. One possibility is to minimize

27

Page 28: \section*{Introduction}€¦  · Web viewIntroduction. Cluster analysis ... Occasional reference is made to principal component analysis (PCA), and the reader will need a working

where is the membership of case i in group g, with and ; m is a ‘fuzzification’ factor; and dig is the Euclidean distance of case i to the centroid of group g. If m = 1 a crisp clustering, equivalent to a non-fuzzy k-means clustering, is obtained; as m increases the classification becomes increasingly fuzzier, a totally fuzzy classification being one in which the membership of all G groups is equally likely. The choice of m = 2 is common.

Mixture modelsThe following short account is based on Papageorgiou et al. (2001). The observed data are x = (x1, x2, ..., xn) where xi is a p-valued vector. If xi is selected from the gth component of the mixture it is assumed to have probability density where

=

and and are the mean and covariance matrix of the gth component.

In the mixture maximum-likelihood the likelihood maximized is

where Unless constraints are placed on the parameters, there are parameters to estimate. For many problems the number of

parameters to estimate will exceed the sample size, so that constraints must be imposed. Most usually the constraint that the covariance matrices, , are the same is used.

In classification maximum-likelihood approach let if xi belongs to the gth component. Initially the values of are unknown. The likelihood for the data can be written in the form

and the labels, , as well as and must be estimated. This results in a direct clustering of the observations. It is usually impractical to maximise the above likelihood without some constraints on the parameters.

28