DrennanStats in Archeology

9
  This article was originally published in the Encyclopedia of Archaeology , published by Elsevier, and the attached copy is provided b y Elsevier for the author's benefit and for the benefit of the author's institution, for non- commercial resea rch and educational use including use in instruction at your institution, posting on a secure network (not accessible to the public) within your institution, and providing a copy to your institution’s administrator. All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open internet sites are prohibited. For exceptions, permission may be sought for such use through Elsevier's permissions site at: http://www.elsevier.com/locate/permissionus ematerial Drennan Robert D, STATISTICS IN ARCHAEOLOGY. In: Encyclopedia of Archaeology, ed. by Deborah M. Pearsall. © 2008, Academic Press, New York.

description

Archaeology

Transcript of DrennanStats in Archeology

  • This article was originally published in the Encyclopedia of Archaeology, published by Elsevier, and the attached copy is provided by Elsevier for the

    author's benefit and for the benefit of the author's institution, for non-commercial research and educational use including use in instruction at your institution, posting on a secure network (not accessible to the public) within

    your institution, and providing a copy to your institutions administrator.

    All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open

    internet sites are prohibited. For exceptions, permission may be sought for such use through Elsevier's permissions site at:

    http://www.elsevier.com/locate/permissionusematerial

    Drennan Robert D, STATISTICS IN ARCHAEOLOGY. In: Encyclopedia of

    Archaeology, ed. by Deborah M. Pearsall. 2008, Academic Press, New York.

  • of.

    OGY

    along with finding patterns in and comparing num-

    graphically for comparative purposes.

    STATISTICS IN ARCHAEOLOGY 2093

    Author's personal copycontext. Any of these works can be consulted forfurther discussion of aspects of statistical analysistouched on here. Traditional statistical tools and

    te19themstatistical analysis in archaeology is evidenced bye number of books published in recent years withe purpose of introducing and explaining the toolsstatistical analysis in a specifically archaeological

    bers, are essential archaeological tasks. Statisticalanalysis is especially associated with certain schoolsof thought in archaeology; for example, it was stronglyadvocated as processual archaeology developed (seeProcessual Archaeology). Counting and measuring dif-ferent kinds of things found in the archaeologicalrecord, however, are so fundamental to the processof using material remains to reconstruct past humanactivities, that statistical analysis cannot be ignoredby any school of thought in which it matters toknow what people did in the past. The importanceofththof

    States, Rise of See: Political Complexity, Rise

    STATISTICS IN ARCHAEOL

    Robert D Drennan, University of Pittsburgh,Pittsburgh, PA, USA

    2008 Elsevier Inc. All rights reserved.

    quantitative analysis Analysis of information that comes inthe form of numbers, relies heavily on the tools of statistics.

    sampling bias Selection of a sample in such a way that somemembers of the population are less likely to be included thanothers.

    vagaries of sampling The variation that can be expected tooccur by pure random chance between different samples selectedfrom the same population.

    Archaeological data are irrevocably (although notexclusively) quantitative in nature. The phenomenathat archaeologists work with (including artifacts,ecofacts, features, remains of architecture, and manymore) are classified, counted, and measured in vari-ous ways. The results are numbers, often quite a lotof numbers. Describing things quantitatively, then,

    rminology developed between about 1920 and50 are most often used in archaeology, althoughe perspectives and vocabulary of the more recentxploratory data analysis school are becomingore widely known.

    Encyclopedia of ArchaeologDescription

    Categories

    The basic tools of descriptive statistics are really quitesimple and well known. Classification (especially ofartifacts, but also of other things) has long been aquintessential archaeological activity. When things areclassified, and howmany of them there are in the differ-ent categories is determined, the results are one of thetwo basic kinds of numbers archaeologists work with:counts (as summaries of categorical variables). Countsare often usefully expressed as proportions, or percen-tages, especially for comparisons as, for example,when the proportions of different ceramic types recov-ered from different sites are compared with each other.It is proportions that make it possible to compare twoor more sets that contain different numbers of things.Thus, for example, flakes aremore strongly representedin an excavation unit where they comprise 25% ofthe artifacts recovered than in a unit where they com-prise only 15%, even if the 25% in the former unitconsists of only 10 flakes (out of 40 artifacts) andthe 15% in the latter consists of 30 flakes (out of 200artifacts). This common sense use of percentages, famil-iar from elementary school, is sometimes referred to asstandardizing for different sized collections, or samples,although it is probably better to reserve the term stan-dardize for a more precise and specific statistical use.Proportions are commonly represented graphically

    in many different ways, some much more effectivethan others. Bar graphs are perhaps the simplest andclearest of such graphics. The ease with which com-monly available computer programs can be used todraw very complicated forms of bar graphs, oftenwith showy three-dimensional effects, sometimesundermines the utility of this, and other graphic tech-niques. Figure 1 illustrates good and bad approachesto representing a relatively simple set of percentagesMeasurements

    The other, fundamentally different, kind of numberarchaeologists often work with (in addition to counts)is measurements, such as length, width, height, thick-ness, area, weight, etc. Measurements are made in

    y (2008), vol. 3, pp. 2093-2100

  • %2094 STATISTICS IN ARCHAEOLOGY

    Author's personal copy100

    80

    60

    40

    20

    0

    40

    30

    20

    10

    01 2 3 4 5 6 7 8

    A B CSite

    Site A Site B

    D

    Artifacttype

    12345678

    %

    %

    Artifact type

    40

    30

    20

    10

    01 2 3 4 5 6 7 8

    %

    Artifact type

    defined units along scales that are, in principle at least,infinitely subdivisible. Sets of measurements, oftencalled batches, have several properties it is importantto be aware of when engaging in fundamental descrip-tion. In the first place, measurements of a single kind ofthing tend usually to bunch up around a clear centralpoint along the measurement scale. In what is called anormal shape, there is a single central bunch of num-bers with a symmetrical shape or distribution taperingoff on either side of the central point when displayed asa stem-and-leaf plot or as a histogram, the two mostcommon ways to represent the shape of a batch ofmeasurements graphically (Figures 2 and 3).It is useful for basic description, and fundamental

    for further statistical analysis, to have a single indexof the center (or measure of central tendency) for abatch of measurements. The mean, or average, whichis as familiar as percentages, is most often used. If thebatch of measurements has outliers, as in Figure 3,the mean may not provide a very sensible index torepresent the center. In such cases, the median (essen-tially the middle number in the batch) may be a moreuseful index. Similarly, if the batch has a very asym-metrical shape (is skewed), as in Figure 3 the mean

    Figure 1 Good and bad ways to illustrate the proportions of artifaStacked bars at the upper left are confusing and do not make it easy t

    right is almost entirely undecipherable. It can be difficult to resist the tem

    like those at the bottom are much more effective presentations of inf

    C have similar assemblages, dominated by artifact types 5 and 6, whi

    sites A and C and from each other.

    Encyclopedia of Archaeology

    Site C Site D

    Site

    20

    10

    DC

    BA

    1 23Artifa

    ct type4 5

    6 78

    40

    30

    20

    10

    01 2 3 4 5 6 7 8

    %

    Artifact type

    40

    30

    20

    10

    01 2 3 4 5 6 7 8

    %

    Artifact type40

    30may not be a sensible index of its center. The medianmay, again, be more useful, or, for statistical analysesthat depend on the special properties of the mean, thebatch can be transformed to make its shape moresymmetrical, for example, by using the squares orsquare roots or logarithms of all the numbers insteadof the original measurements. Occasionally, a batchof measurements may show two or more clearbunches of numbers, as in Figure 4. Once again,simply calculating the mean of such a batch producesnonsensical results. The batch must be separated intotwo or more sub-batches, because the presence ofmultiple clear bunches of numbers is an unmistakableindication that more than a single kind of thing isrepresented. Exploration of the shape of a batch ofmeasurements is an essential first step in almost anystatistical analysis of it, so as to be sure that a sensibleand relevant index of its center can be identified.

    Inferences from Samples

    Beyond simply exploring and describing the patternsobserved in batches of measurements or counts of

    ct types for comparison of the assemblages from different sites.

    o recognize patterns. A bar chart in three dimensions at the upper

    ptation to use such glamorous graphics, but simple, flat bar charts

    ormation. In these charts it is easy to recognize that sites A and

    le the artifact proportions at sites B and D differ sharply, both from

    (2008), vol. 3, pp. 2093-2100

  • 1g f

    STATISTICS IN ARCHAEOLOGY 2095

    Author's personal copy0 59001256688990001222233344555677788900011222333445556667777888999900000122222333444455556667788899900112333445567899001234479145

    11223344556677

    20

    15

    10

    5

    00

    Figure 2 The distribution of a batch of 128 measurements rangin

    categories, and perhaps comparing them with otherbatches, looms the issue of using a batch (or sample)to characterize a larger set (or population) of thingsnot available for study. Sometimes archaeologistsconsciously select samples from larger populations,as when they choose some squares in a site gridfor excavation or some artifacts for raw materialsourcing from those recovered in an excavation.Sometimes the process is less obvious, as when anarchaeologist describes, say, Formative period flakedstone tools. In this case, the population is very largeand vaguely defined all the flaked stone tools madeduring the Formative period in some region and itis characterized on the basis of those that happen

    and a histogram (right). The mean of 39.1 provides a sensible index o

    bunch of numbers. The shape is not perfectly symmetrical, and there

    distribution of site sizes have sometimes been taken to indicate two d

    interpretation is not warranted. Batches of measurements rarely confor

    to a normal shape as can be expected in a sample of this size.

    25

    20

    15

    10

    5

    00 50 100 150 200

    MeanMedian

    Figure 3 When a batch has outliers (the high numbers scattered fa(24.1) may be strongly affected and no longer provide a good indication

    the median (15.4) is a better index. Similarly, when a batch is skewed

    mean (181.4) that does not indicate the center of the main bunch of num

    is required (see text).

    Encyclopedia of Archaeolog

    0 20 30 40 50 60 70 80

    rom 5.7 to 75.7, displayed graphically in a stem-and-leaf plot (left)to have been recovered and are thus available forobservation. Even if 100% of the flaked stone toolsthat have been recovered are studied, it is still aquestion of samples and populations since the de-scription is ordinarily taken to apply to all Formativeprojectile points. Many archaeologists do not recog-nize this, and for this and other reasons, the archaeo-logical literature is rife with misunderstanding andmisuse of statistical tools.

    Sampling Bias

    Whenever statements are made about a populationon the basis of a sample, there is at least some risk of

    f the center of this batch because it falls at the middle of the main

    is a small valley in the 2530 interval. Such small valleys in a

    ifferent kinds of sites and thus settlement hierarchy, but such an

    m perfectly to theoretical distributions, and this example is as close

    15

    10

    5

    0160 170 180 190 200

    MeanMedian

    r from the other measurements in the histogram at left), the mean

    of the center of the main bunch of measurements. In such a case

    (as in the histogram at right), the asymmetrical shape produces a

    bers, and the median (182.4) is not much better. A transformation

    y (2008), vol. 3, pp. 2093-2100

  • 2096 STATISTICS IN ARCHAEOLOGY

    Author's personal copy40

    30

    20

    10

    030 40 50 60 70 80

    Mean (lower) Mean (upper)Mean (overall)

    Figure 4 A batch of measurements whose distribution shows, inthe form of two clear peaks or bunches of values, that fundamen-

    tally two different kinds of things are included. It makes no sense

    to calculate the overall mean (59.2) of such a batch because it

    indicates, not the center of anything, but rather the valley between

    the two peaks. Such a bimodal batch must be split into two

    separate parts for analysis. When this is done, using a measure-

    ment of 55 as a dividing point (dashed line), the shapes of the two

    resulting batches are single peaked and symmetrical enough that

    their means (45.7 and 65.3, respectively) provide a good index of

    each of the two centers. While it might be tempting to separate out

    a third batch below about 38 and a fourth above about 73, such

    minor deviations from a normal shape are to be expected as a

    consequence of random variation and are not meaningful.

    error. The tools of inferential statistics cannot elimi-nate this risk, but they provide powerful ways ofassessing it and working with it. Practically everyanalysis in archaeology (whether statistical or not)involves characterizing a larger set of things than areactually observed, so the perspectives of inferentialstatistics have implications for archaeological ana-lyses that reach far beyond the quantitative contextsthey were designed for. The risk of error in character-izing a population on the basis of a sample arises fromtwo fundamentally different sources. One is that theprocess of selecting the sample from the populationmay systematically produce a sample with character-istics different from those of the population at large.This is known as sampling bias. It happens, forexample, when lithic debitage is recovered by passingexcavated deposits through screens with 6 mm mesh.Lithic debitage is known to include very small wasteflakes, many of which will pass through mesh of thissize, so the sample recovered from the screen will besystematically larger than the complete population.The mean weight of such a sample of waste flakeswould be higher than that of the population as awhole, and any statement made on the basis of thissample about the weight of waste flakes in generalwould be inflated as a direct consequence of thissampling bias.

    Encyclopedia of Archaeology

    Precisely the same is true even in entirely nonquan-titative analyses. An archaeologist might subjectivelycharacterize the lithic technology of the Archaic peri-od in some region as reflecting a broad application ofa high degree of technical skill. If this characterizationof the large, vaguely defined population consisting ofall lithic artifacts produced in the region during theArchaic period is based on a sample recovered byartifact collectors who value well-made bifacialtools and never bother to keep anything as mundaneas a utilized flake, then the sample is clearly biasedtoward well-made tools, and the breadth of applica-tion of high technical skill will be overvalued as adirect consequence of sampling bias.Rigorously random procedures for sample selec-

    tion are designed to avoid bias, and neither of thesampling procedures in the examples above is ran-dom. There are no statistical tools for removing biasfrom a sample once it has been selected, whether byscreening deposits through large mesh or by collec-tors. Indeed, the tools of inferential statistics are oftensaid to assume that samples are unbiased, and thusto be inapplicable to the analysis of biased samples.This is not a useful way to approach statistical analy-sis in archaeology, because archaeologists are oftenforced to work with samples they know to be biased.The prescription offered in other disciplines (collectanother sample with rigorously random proceduresthat avoid bias) is often impossible in archaeology.Fortunately, there are at least two common ways ofworking with biased samples. Archaeologists maywant to say things about populations that would notbe affected by the bias present in the available sample.It might, for example, be perfectly possible to makeunbiased conclusions about the proportions of rawmaterials represented in lithic debitage on the basis ofthe screened sample discussed above. It would only benecessary to assume that different raw materialswould not occur among the small waste flakes thatfell through the screen in proportions very differentfrom those among the larger flakes that would notescape the sample in this way. Biased samples can alsooften be accurately compared to each other if thesame biased operated to the same degree in the selec-tion of all samples to be compared. Thus, two sam-ples of lithics recovered from 6mm screen mayshow rather different flake weights. Such a differencecannot be attributed to sampling biases if the sam-pling biases were the same in both cases, and it isvalid to say that the flakes in one sample are heavierthan those in the other.Precisely, the same principles apply to subjective

    and qualitative comparisons. To compare the appli-cation of technical flint-knapping skill to the produc-tion of lithic tools in the Archaic with that in the

    (2008), vol. 3, pp. 2093-2100

  • STATISTICS IN ARCHAEOLOGY 2097

    Author's personal copyFormative, it may well be possible to work success-fully with biased samples. (This is likely to be neces-sary in any event.) As long as the same samplingbiases operated in the recovery of both Archaic andFormative artifacts, then they can be compared. If,however, the Formative tools come from systematicexcavations and the Archaic ones are those accumu-lated by artifact collectors, then the sampling biasesare different and likely to affect very strongly precise-ly the characteristics of interest. Such a difference insampling biases affects any comparison based on theabundance of nice well-made tools, whether thecomparison is quantitative or not. To repeat, statisti-cal tools cannot eliminate bias once a sample has beenselected (and the utter elimination of all kinds ofsampling bias from the process of archeologicalrecovery is an unrealistic goal in any event). Statistics,however, does provide two very useful things: first, areminder of the important effects that sampling biascan have, and, second, some useful concepts for think-ing about sampling bias and how serious a worry it isin the specific context of particular observations ofpotential interest.

    Vagaries of Sampling

    Sampling bias, then, is one of the two principalsources of error in making conclusions about a popu-lation on the basis of a sample. The other is thatsamples selected entirely without bias still differfrom each other and from the population they wereselected from because of pure random chance. This isoften referred to as the vagaries of sampling and iseasily approached by imagining tossing a coin. Whena coin is tossed honestly four times, it is a completelyunbiased sample of four from the infinitely largepopulation composed of all the times the coin couldbe tossed. Assuming no prior knowledge at all aboutthe principles of coin tossing, this sample of fourcould be used to infer the proportions of heads andtails in the large population of all possible coin tosses.The inference made would not always be the same,because while common sense tells us that the propor-tion of heads in that large population is 50%, it alsotells us that in any given sample of four, it might wellnot turn out exactly that way. Sometimes, in a sampleof four coin tosses, the proportion of heads would be50%, sometimes 25%, or 75%, and sometimes even100% or 0%. An analyst with a sample of two headsand two tails would conclude that the proportion ofheads among coin tosses in general was 50%. Ananalyst with a sample of one head and three tails,however, would have to conclude that the proportionof heads among coin tosses in general was 25%, andone with a sample of four tails would have to con-clude that the proportion of heads among coin tosses

    Encyclopedia of Archaeologin general was 0%. It is easy to see that an analystmaking conclusions about a population on the basisof a sample from it will sometimes be right and some-times wrong. This is true even if sampling bias canbe completely ruled out (as it can in this hypotheticalexample); the erroneous conclusions are the resultof pure blind luck the completely random vagariesof sampling (see Sampling Methods, Theory andPraxis).It is the vagaries of sampling that the tools of infer-

    ential statistics deal with. They are derived ultimatelyfrom a consideration of the range of possible outcomesin selecting a sample of a given size from a givenpopulation. In the coin tossing example, the 16possible(and equally probable) outcomes for a sample of fourare easily enumerated: HHHH, HHHT, HHTH,HHTT, HTHH, HTHT, HTTH, HTTT, THHH,THHT, THTH, THTT, TTHH, TTHT, TTTH,TTTT. The correct outcome of a representativesample (one that represents its parent population ac-curately) is easily seen to occur six times; four timesheads are 1/4 or 25%; four times 3/4 or 75%; once 0/4or 0%, and once 4/4 or 100%. Several of the impor-tant principles underlying inferential statistics are easi-ly seen in the samples of four coin tosses. First, anunbiased sample gives an accurate answer more oftenthan it gives any other single answer. Second, it ispossible for a completely unbiased sample to representits parent population inaccurately. Thus, contrary tocommon belief, random (unbiased) sample selection isno guarantee that a sample is representative. Third, asample very strongly different from its parent popula-tion occurs less often than one fairly similar to theparent population. Thus, when reasoning from sam-ples to populations, serious errors are less commonthan moderate errors.A final fundamental principle of inferential statistics

    is firmly embedded in common sense as well: largersamples are more reliable than smaller samples. Thespecific implications of this principle for the coin tos-sing example can be seen by expanding the thoughtexperiment to samples of eight. That is, instead ofimagining tossing the coin four times, we imaginetossing it eight times. For eight coin tosses, there are256 equally probable different outcomes. The mostseriously erroneous sample results for estimatingthe proportion of heads in coin tossing are lesscommon than with a sample of four. Samples with aproportion of heads of 25% or less or 75% or more,for example, represent only 74/256 or 29% of theoutcomes, compared to 10/16 or 63% for samples offour coin tosses. Larger samples, then, are more reli-able, not because beyond some point they are guaran-teed to give an absolutely accurate result, but ratherbecause, as sample size increases, results will be fairly

    y (2008), vol. 3, pp. 2093-2100

  • but only with 80% confidence (i.e., a 20% chance of

    can be discussed in terms of statistical significance,which is simply the mirror image of statistical confi-dence. The difference between these two ceramicassemblages could be said to be highly significant(p< 0.01), meaning that there is less than a 1%chance that two unbiased samples with such differentbowl sherd proportions would be selected from popu-lations that did not differ at all with regard to bowlsherd proportions. In sum, one would have very goodreason to explore what the difference in bowl sherdproportions might mean, because it is extremely likelythat the difference observed between the two samplesdoes indeed reflect a difference between the two parentpopulations that are ultimately the objects of interest.That is to say, it is extremely unlikely that the differ-ence between the two samples is attributable to justthe vagaries of sampling. The two samples are of ade-quate size to make it possible to talk with high confi-dence about differences between the two sites in regardto bowl sherd proportions. If the samples had beensmaller, the error ranges would have turned out to belarger for any given confidence level, and differences

    2098 STATISTICS IN ARCHAEOLOGY

    Author's personal copybeing wrong). Statistical confidence, then, takes theform of a statement (or estimate) about a populationwith an error range for a specified confidence level.Estimates like these can be compared graphically,

    as in Figure 5. Here the proportion of bowl sherds at asite, as estimated from the sample discussed above, iscompared with the estimate for another site. The

    accurate a steadily growing proportion of the time.The effect of the vagaries of sampling, then, is reducedwhen sample size is increased. This effect dependsentirely on the size of the sample; contrary to wide-spread opinion; it has nothing to do with the size ofthe population or the proportion of the populationincluded in the sample.

    Estimates, Confidence, and Significance

    These principles are often summed up and expressedas statistical confidence. For example, an archeologistrecovers a sample of sherds from a site, and 35% ofthe sherds recovered are from serving bowls. If thesample is unbiased (i.e., if the recovery process wouldnot systematically privilege collecting either bowlsherds or non-bowl sherds), then the best guess thatcan be made (the one most likely to be near the realpopulation value) is that 35% of the sherds at the siteare from bowls. Clearly, however, it is entirely possi-ble that the vagaries of sampling could produce asample of 35% bowls, even though the proportionin the parent population was different from thatfigure. Just how much risk there is of error in theestimate of the proportion of bowls in the ceramicassemblage as a whole, can be expressed as an errorrange for a given confidence level: say, 35% 6% atthe 95% confidence level. This means that one can be95% confident that the proportion of bowl sherds inthe ceramic assemblage as a whole is between 29%and 41%. Statistical confidence is not just a yes-or-noconcept; it is a continuous scale. Choosing to speak atthe 95% confidence level means being right 95% ofthe time (and, inevitably, wrong 5% of the time).Being 95% confident that a ceramic assemblage con-sists of 35% 5% bowls means there is a 5% chancethat the ceramic assemblage actually consists of morethan 41% bowls or less than 29% bowls. One couldspeak at a higher confidence level, say 99%, on thebasis of the same sample, but only at the cost of lessprecision, reflected in a larger error range: with thesame sample, one can be 99% confident that theproportion of bowl sherds in the assemblage is35% 8%. Conversely, one can speak more preciselyon the basis of the same sample, but only at the cost ofstatistical confidence: one could say that the propor-tion of bowl sherds in the assemblage is 35% 4%,

    Encyclopedia of Archaeology error ranges for different confidence levels are repre-sented graphically, showing that either sites estimatefalls outside even the 99% confidence level errorrange for the other. One can thus be more than 99%confident (based on the two samples) that the propor-tions of bowl sherds in the complete assemblagesfrom the two sites do indeed differ. The 99% confi-dence level means that there is less than a 1% chancethat the statement is actually wrong (i.e., that bowlsherd proportions are actually the same in the twocomplete site assemblages). Precisely this same issue

    55

    50

    45

    40%

    35 99%95%

    80%Confidence

    level

    30

    25

    Figure 5 A bullet graph comparing proportions of bowl sherdsin samples from two sites. The proportions differ by 10%, and

    since each proportion falls outside even the 99% confidence

    level error range for the other, one can have greater than 99%

    confidence that bowl sherd proportions do differ in two site assem-

    blage populations the samples came from. Expressed in signifi-

    cance terms rather than confidence terms, the difference between

    bowl sherd proportions at the two sites is very significant (p< 0.01).(2008), vol. 3, pp. 2093-2100

  • statistical confidence of 90%. While a higher con-

    STATISTICS IN ARCHAEOLOGY 2099

    Author's personal copyfidence level would be desirable, results one can beonly 90% confident of are definitely worth beingaware of. Colloquial speech recognizes this by cau-tioning that some things must be taken with a grainof salt. Results we are only 90% confident of should,indeed, be taken with a grain of salt, but they shouldnot be ignored; results we are only 80% confident ofmust be taken with an even larger grain of salt, butthey, too, may well be worth attention; and so on. Bythe same token, it is important to remember thatconfidence in excess of 95% or even 99% is still notequivalent to certainty. Providing the actual confi-dence level (70%, 80%, 90%, 95%, 99%, etc.) ofresults is common practice. Providing the actual signif-icance level (p 0.30, 0.20, 0.10, 0.05, 0.01, etc.) isless common, but makes considerably more sense thanjust branding results as significant or not significant.

    Relationships between Variables

    Error ranges for specific confidence levels are at-tached to estimates made for populations on thebasis of samples. These estimates can either be ofproportions (corresponding to counts of categoriesof things, as in the examples above) or of means(if the observation of interest is a measurement).A completely different, but fully equivalent, way oftalking about this task is in terms of relationshipsbetween variables. Comparing proportions of bowlsherds between two sites is equivalent to studying therelationship between two variables: site and vesselform. For each sherd, there are two pieces of infor-mation contained in two categorical variables. Thecategories for the variable site are A and B; the cat-egories for the variable vessel form are bowl andnon-bowl. It is usually easier to think about this

    between the sites would have been identifiable onlywith less statistical confidence. The differences, insuch a case, would be said to be less significant.It is, unfortunately, a common practice to report

    results with a significance probability (p) of greaterthan 0.05, as simply not significant and then ignorethem. A p value greater than 0.05 simply indicatesmore than a 5% chance that the results observed areonly a consequence of random processes operating insamples too small to detect the quantitative effect ofinterest. This is the same as saying one has somethingless than 95% confidence in the results. Clearly,results with very low significance levels (indicatedby very high p values) do not merit much attention.On the other hand, p values modestly greater than0.05, are decidedly worth knowing. A significanceprobability of 0.10 (twice the level often taken asthe threshold of not significant) is equivalent to

    Encyclopedia of Archaeologinformation as proportions of bowl sherds at differ-ent sites, but the problem can also be formulated asone of the relationship between the variables site andvessel form. If there is a very strong relationshipbetween the two variables, then one site will have asubstantially higher proportion of bowl sherds thanthe other. The bigger the difference in proportions,the stronger the relationship between the two variablesis said to be. When the samples involved are largeenough to permit talking about the difference betweenthe sites with high statistical confidence, then the rela-tionship between the two variables is highly signifi-cant. Strength and confidence (or significance)are two related but quite different concepts.To repeat, in this example, strength has to do with

    how big the difference between the samples is; confi-dence (or significance) has to do with whether, giventhe strength of the difference, the samples are largeenough to make it possible to speak of a differencewith much confidence that it exists, not just betweenthe samples, but also between the populations thesamples come from. If tossing a coin twice turns uptwo heads, the result (100% heads) is very stronglydifferent from our theoretical expectation of 50%heads. The result is not, however, very significantsince it is quite likely that such a sample could beproduced by nothing more than the vagaries of sam-pling. On the other hand, if we toss a coin 2000 timesand turn up 75% heads, the result is not as stronglydifferent from our theoretical expectation of 50%heads, but it is considerably more significant becauseit is very unlikely that just pure random luck wouldproduce such a high percentage of heads in such alarge sample.When examining the relationship between two

    variables, it is important to consider both the strengthand the significance of the relationship. When the twovariables are both categories of things which wecount, a common approach to evaluating strengthand significance is the chi-square test. It yields asignificance probability (p), as discussed above, andvarious measures of strength, including Cramers Vand phi. Cramers V provides a number on a scalefrom 0 to 1, where 0 means a relationship of nostrength between the variables (i.e., no relationship)and 1means the strongest possible relationship. In theexample of Figure 5, where two sites have 35% and45% bowl sherds, respectively, the value of CramersV is 0.1, indicting a strong enough relationship to bemeaningful, even though it seems not very differentfrom 0. A Cramers V of 1 in this example wouldbe produced only if one site had 100% bowl sherdsand the other had none. Phi is the same as Cramers Vwhen there are only two categories for each variable,as in this example: two sites (A and B) and two vessel

    y (2008), vol. 3, pp. 2093-2100

  • forms (bowls and non-bowls). When more than twocategories are involved for one variable or the other,Phi becomes an open-ended scale and is much moredifficult to interpret in absolute terms.Different statistical tools are needed if the two

    variables under consideration include one categoricalvariable and one measurement, as, for example,comparing utilized flake lengths between two sites.In this case, the two pieces of information for eachartifact are which site it came from (two categories)and how long it is (a measurement along a continuousscale). The strength and significance of any differencecan be characterized and represented graphically as inFigure 5, except that the vertical scale, instead ofproportion of bowl sherds, represents mean flakelength. Alternatively, the strength and significance ofthe relationship between the two variables site andflake length can be evaluated with a t test. The t testyields a significance probability (p) and a measure ofstrength (t), on an open-ended scale. When the cat-

    rim diameter, an r value of 1 would indicate a perfectpositive correlation the larger the vessel, the thickerthe wall. An r value of 1 would indicate a perfectnegative correlation the smaller the vessel thethicker the wall. The square of r is often taken as anindication of the proportion of variation in one vari-able that is explained by the relationship with theother variable. If the correlation (r) between rimdiameter and wall thickness were 0.85, this wouldbe a strong positive correlation, in which 72%(r2 0.72) of the variation in wall thickness isaccounted for by rim diameter.

    See also: Archaeology Laboratory, Overview; Proces-sual Archaeology; Sampling Methods, Theory andPraxis.

    Further Reading

    , R

    2100 STATISTICS IN ARCHAEOLOGY

    Author's personal copyegorical variable consists of more than two catego-ries, an analysis of variance (ANOVA) can be used toproduce the usual significance probability (p) and ameasure of strength (F), on an open-ended scale.If the two variables are both measurements, no

    categories are involved, and the problem cannot beformulated as a comparison between categories. Themost powerful statistical approach is regressionanalysis, which again produces a significance proba-bility (p) and, as a measure of strength, the correla-tion coefficient (r). The scale of r ranges from 0 to 1(in both positive and negative directions), with 0 indi-cating a relationship of no strength and 1 (or 1)indicating the strongest possible relationship. If thetwo variables were ceramic vessel wall thickness and

    Stone Tools See: Lithics: Manufacture.

    Stratified Societies See: Political Complexity

    Encyclopedia of Archaeology ise of; Social Inequality, Development of.

    AldenderferMS (ed.) (1987)Quantitative Research In Archaeology:Progress and Prospects. Newbury Park, CA: Sage Publications.

    Baxter MJ (1994) Exploratory Multivariate Analysis in Archaeol-ogy. Edinburgh: Edinburgh University Press.

    Cowgill GL (1977) The trouble with significance tests and what wecan do about it. American Antiquity 42: 350368.

    Drennan RD (1996) Statistics for Archaeologists: A CommonsenseApproach. New York: Plenum Press.

    Fletcher M and Locke GR (1991) Digging Numbers: ElementaryStatistics for Archaeologists. Oxford: Oxford University Com-mittee for Archaeology.

    Shennan S (1997) Quantifying Archaeology, 2nd edn. Edinburgh:Edinburgh University Press.

    Thomas DH (1986) Refiguring Anthropology: First Principles ofProbability and Statistics. Prospect Heights, IL: Waveland Press.

    Wilkinson L (2000) Cognitive science and graphic design. In: Systat10 Graphics, pp. 118. Chicago: SPSS, Inc.(2008), vol. 3, pp. 2093-2100

    STATISTICS IN ARCHAEOLOGYDescriptionCategoriesMeasurements

    Inferences from SamplesSampling BiasVagaries of SamplingEstimates, Confidence, and SignificanceRelationships between Variables

    Further Reading