Computing in Archaeology Basic Statistics Week 8 (25/04/07) © Richard Haddlesey .

Post on 27-Mar-2015

222 views 3 download

Tags:

Transcript of Computing in Archaeology Basic Statistics Week 8 (25/04/07) © Richard Haddlesey .

Computing in Computing in ArchaeologyArchaeology

Basic StatisticsBasic Statistics

Week 8 (25/04/07)Week 8 (25/04/07)© Richard Haddlesey www.medievalarchitecture.net

AimsAims

To familiarise ourselves with KEY To familiarise ourselves with KEY statistical terms and their meaningsstatistical terms and their meanings

To understand the use of stats in To understand the use of stats in archaeologyarchaeology

To assign variables, appropriate To assign variables, appropriate levels of measurement, at the levels of measurement, at the recording levelrecording level

Key textsKey texts

Basic StatsBasic Stats

Batch

VariablesVariables

Case Case Case

Post holes

Length, area, diameter

Post hole ID

VariablesVariables

Variables are measured according Variables are measured according to one of FOUR levelsto one of FOUR levels

1.1. Nominal Nominal = arbitrary name= arbitrary name

2.2. OrdinalOrdinal = sequence with no distance= sequence with no distance

3.3. IntervalInterval = sequence with fixed distance= sequence with fixed distance

4.4. RatioRatio = sequence with a fixed = sequence with a fixed datumdatum

Vince NOIRVince NOIR

NNominalominal OOrdinalrdinal IIntervalnterval RRatioatio

Nominal examplesNominal examples

ConditionCondition AgeAge DiameterDiameter LengthLength ContextContext PeriodPeriod

Ordinal examplesOrdinal examples

ConditionCondition1.1. ExcellentExcellent

2.2. GoodGood

3.3. FairFair

4.4. PoorPoor

Here “2” may be between “1” and Here “2” may be between “1” and “3” but is unlikely to be of equal “3” but is unlikely to be of equal distancedistance

Interval examplesInterval examples

PeriodPeriod1.1. Late Bronze (1200-650)Late Bronze (1200-650)2.2. Early Iron (649-100)Early Iron (649-100)3.3. Late Iron (100+)Late Iron (100+)

Here, if we have 3 artefacts dated Here, if we have 3 artefacts dated 150BC, 300BC and 450BC, although 150BC, 300BC and 450BC, although bb may be equal distance between may be equal distance between aa and and cc, , cc is not twice as old as is not twice as old as aa..

This is because there is no datum.This is because there is no datum.

Ratio examplesRatio examples

Age instead of periodAge instead of period• 1000 ya is twice 500 ya1000 ya is twice 500 ya• 20kg is twice 10kg20kg is twice 10kg

Ratio is the highest level of Ratio is the highest level of measurement because it has a measurement because it has a datum datum

Mortlakestyle bowl

Fengatestyle bowl

Grooved ware jar

Nominal, Ordinal and Interval

Note!Note!

Avoid using 0 or 1 to indicate such Avoid using 0 or 1 to indicate such variables as yes or no, as we may variables as yes or no, as we may need to know if it is “no” or “no data”need to know if it is “no” or “no data”

Also when using presence or absence Also when using presence or absence you may wish to add “missing” to you may wish to add “missing” to avoid confusionavoid confusion

Further distinctionFurther distinction

Nominal and OrdinalNominal and Ordinal• = categorical= categorical• = qualitative= qualitative

Interval and RatioInterval and Ratio• = continuous= continuous• = quantitative= quantitative

CodingCoding

Nominal and Ordinal often need coding, to Nominal and Ordinal often need coding, to minimise errors, via a keyword indexminimise errors, via a keyword index

con = contextcon = context• str = stray findstr = stray find• set = settlementset = settlement• bur = burialbur = burial

Avoid 1,2,3,etc, as you will have to keep Avoid 1,2,3,etc, as you will have to keep looking up their meanings which is time looking up their meanings which is time consumingconsuming

CodingCoding

NOTE!NOTE!

EVERY DATA VALUE MUST HAVE A EVERY DATA VALUE MUST HAVE A CODE AND ONLY ONE CODE!CODE AND ONLY ONE CODE!

GroupingGrouping

Good for periods, as in Good for periods, as in • Late Bronze (1200-650)Late Bronze (1200-650)• Early Iron (649-100)Early Iron (649-100)• Late Iron (100+)Late Iron (100+)

NOTE: it is better to record as a NOTE: it is better to record as a continuous variable (i.e. 780BC), continuous variable (i.e. 780BC), then group as an output (i.e. Late then group as an output (i.e. Late Bronze)Bronze)

Good PracticeGood Practice

Always keep a “CLEAN” version of Always keep a “CLEAN” version of the original data setthe original data set

Exploring the dataExploring the data

Context FNO Taxon Bone z1 z2 z3 z4 z5 z6 F/U L/R art. sex NISP chop cut m1 m2 m3 m4269 58 bs mn 0 0 0 0 0 0 - r - - 1 35.9 14.6722 191 eq sc 1 1 1 1 1 1 f r 2 - 1 78.2 40.7 55.6722 191 eq sc 1 1 1 1 1 1 f l 2 - 1 78.7 41.4 48.5371 102 eq sc 1 1 1 1 1 1 f r - - 1 45.0 58.0 52.9722 191 eq cal 1 1 1 1 1 0 f r 2 - 1 90.6 45.0722 191 eq mp 1 1 1 0 0 0 f l 2 - 1 41 45.6 40.3 28.7722 191 eq mp 1 1 1 0 0 0 f r 2 - 1 42 46.0 39.5 29.4722 191 eq mp 1 1 1 0 0 0 f r 2 - 1 46.0 39.7 28.5285 72 bs cal 1 1 1 1 1 0 f r - - 1 1 1 137.5 46.3722 191 eq mp 1 1 1 0 0 0 f l 2 - 1 42 46.3 40.0 29.2722 191 eq pp 1 1 1 0 0 0 f l 2 - 1 71 48.7 45.0 32.5722 191 eq pp 1 1 1 0 0 0 f r 2 - 1 71 48.8 45.2 32.5722 191 eq pp 1 1 1 0 0 0 f r 2 - 1 68 49.0 45.0 34.1722 191 eq pel 1 1 1 1 1 1 f l 2 - 1 60.1 52.2722 191 eq ast 1 1 1 1 0 0 - r 2 - 1 51 53 44.9722 191 eq ast 1 1 1 1 0 0 - l 2 - 1 51 54 44.4 52.7722 191 eq mciii 1 1 1 1 1 1 f r 2 - 1 187 179 43.7 28.6722 191 eq mciii 1 1 1 1 1 1 f l 2 - 1 187 180 42.8722 191 eq mtiii 1 1 1 1 1 1 f l 2 - 1 229 223 41.4 39.1722 191 eq mtiii 1 1 1 1 1 1 f r 2 - 1 229 223 42.8 39.5722 191 eq hum 1 1 1 1 1 1 f/f r 2 - 1 232 30.8722 191 eq rad 1 1 1 1 1 1 f/f l 2 - 1 274 71.7 64.2

example data set

univariate frequency tableunivariate frequency table

speciesspecies frequencyfrequency

cattlecattle 187187

sheepsheep 109109

pigpig 7878

horsehorse 2121

TotalTotal 395395

speciesspecies pitspits ditchesditches TotalTotal

cattlecattle 6767 120120 187187

sheepsheep 6363 4646 109109

pigpig 4141 3737 7878

horsehorse 33 1818 2121

TotalTotal 174174 221221 395395

bivariate frequency tablebivariate frequency table

bivariate frequency tablebivariate frequency table

speciesspecies pitspits ditchesditches TotalTotal

cattlecattle 67 67 39%39% 120 120 54%54% 187187

sheepsheep 63 63 36%36% 46 46 21%21% 109109

pigpig 41 41 24%24% 37 37 17%17% 7878

horsehorse 3 3 2%2% 18 18 8%8% 2121

TotalTotal 174 174 100%100% 221 221 100% 100% 395395

MultivariateMultivariate

These tend to operate on a table, or These tend to operate on a table, or matrix of items, described in terms of matrix of items, described in terms of a set of variablesa set of variables

Pictorial displays forPictorial displays forcategorical datacategorical data

0

5

10

15

20

25

30

35

40

45

50

cattle sheep pig horse

%

bar chart

0

10

20

30

40

50

60

cattle sheep pig horse

%

pits

ditches

multiple bar chart

pie chart

Pictorial displays forcontinuous data

0

2

4

6

Co

un

t

Hunt's House

Monkton

4 9.0 5 0.0 5 1.0 5 2.0 5 3.0 5 4.0 5 5.0 5 6.0 5 7.0 5 8.0 5 9.0 6 0.0 6 1.0 6 2.0 6 3.0 6 4.0 6 5.0 6 6.0 6 7.0 6 8.0 6 9.0 7 0.0 7 1.0 7 2.0

Bd (mm)

0

2

4

6

Co

un

t

histogram

Basic descriptive statistics:

• mode• median• mean• range• variance• standard deviation

pottery fragments (weights in grams):2, 2, 3, 5, 8

pottery fragments (weights in grams):2, 2, 3, 5, 8

Mode = 2

ModeMode

Mode is the only way to measure Mode is the only way to measure average/typical in the average/typical in the NominalNominal class class

If there are two averages then they If there are two averages then they are bimodal (1,2,are bimodal (1,2,33,,33,,6,66,6,7,8,9),7,8,9)

Three = trimodal, etc.Three = trimodal, etc.

pottery fragments (weights in grams):2, 2, 3, 5, 8

Mode = 2

Median = 3

MedianMedian

Best for Best for ordinalordinal and above and above

If the number of variables is even, If the number of variables is even, you make a number between the two you make a number between the two middle numbers middle numbers

(1,2,3,(1,2,3,4,54,5,6,7,8 = 4+5/2=,6,7,8 = 4+5/2=4.54.5))

pottery fragments (weights in grams):2, 2, 3, 5, 8

Mode = 2

Median = 3

Mean = (2+2+3+5+8)/5 = 4

MeanMean

The most commonly used average The most commonly used average and, it will only work for and, it will only work for intervalinterval and and ratioratio

It is the most important measure of It is the most important measure of position because a lot of further position because a lot of further statistical analyses are based on itstatistical analyses are based on it

ConclusionConclusion

It is important to understand that the It is important to understand that the modemode, , medianmedian and and meanmean are three quite are three quite different measures of position which can different measures of position which can give three different values when applied to give three different values when applied to the same data-setthe same data-set

2, 2, 3, 5, 8 2, 2, 3, 5, 6, 8

Mode = 2 2 Median = 3 4 Mean = 4 4.333

The The skewskew

symmetrical

Positive skew Negative skew

Measures of variability – the spread

pottery fragments (weights in grams):2, 2, 3, 5, 8

Range =

max – min

8 - 2 = 6

• Very simple and of limited use

variance

key:

pottery fragments (weights in grams):2, 2, 3, 5, 8

s2 =

(2-4)2 + (2-4)2 + (3-4)2 +(5-4)2 + (8-4)2

5

variance (s2)

s2 = 5.2

s2 =

(Mean = 2=2=3=5=8/5=4)

variance

standard deviation

pottery fragments (weights in grams):2, 2, 3, 5, 8

variance (s2) = = 5.2

standard deviation =

= (√variance) = √5.2 = 2.28

SummarySummary

Variables are measured according to Variables are measured according to one of one of FOURFOUR levels levels

1.1. Nominal Nominal = arbitrary name= arbitrary name

2.2. OrdinalOrdinal = sequence with no distance= sequence with no distance

3.3. IntervalInterval = sequence with fixed distance= sequence with fixed distance

4.4. RatioRatio = sequence with a fixed datum= sequence with a fixed datum

SummarySummary

Measures of position Measures of position (average/typical)(average/typical)• ModeMode• MedianMedian• MeanMean• RangeRange• VarianceVariance• Standard DeviationStandard Deviation