Simple Statistics for Corpus Linguistics
description
Transcript of Simple Statistics for Corpus Linguistics
![Page 1: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/1.jpg)
Simple Statistics for Simple Statistics for Corpus LinguisticsCorpus Linguistics
Sean WallisSurvey of English Usage
University College London
![Page 2: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/2.jpg)
OutlineOutline
• Numbers…
• A simple research question– do women speak or write more than men
in ICE-GB?– p = proportion = probability
• Another research question– what happens to speakers’ use of modal shall
vs. will over time?– the idea of inferential statistics– plotting confidence intervals
• Concluding remarks
![Page 3: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/3.jpg)
Numbers...Numbers...
• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)
![Page 4: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/4.jpg)
Numbers...Numbers...
• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)
• We are going to discuss another concept:– probability
• proportion, percentage
– a simple idea, at the heart of statistics
![Page 5: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/5.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
![Page 6: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/6.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n – e.g. the probability that the
speaker says will instead of shall
![Page 7: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/7.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– e.g. the probability that the speaker says will instead of shall
![Page 8: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/8.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– cases of will
– e.g. the probability that the speaker says will instead of shall
![Page 9: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/9.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
– cases of will
– e.g. the probability that the speaker says will instead of shall
![Page 10: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/10.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
– cases of will
– total: will + shall
– e.g. the probability that the speaker says will instead of shall
![Page 11: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/11.jpg)
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
• Probability can range from 0 to 1
– e.g. the probability that the speaker says will instead of shall– cases of will
– total: will + shall
![Page 12: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/12.jpg)
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
![Page 13: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/13.jpg)
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
}How do these
differ in what they might tell
us?
![Page 14: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/14.jpg)
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language
}How do these
differ in what they might tell
us?
![Page 15: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/15.jpg)
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)
}How do these
differ in what they might tell
us?
![Page 16: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/16.jpg)
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)
}How do these
differ in what they might tell
us?
How does this affect the types
of knowledg
e we might
obtain?
}
![Page 17: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/17.jpg)
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
![Page 18: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/18.jpg)
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event - How often?
![Page 19: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/19.jpg)
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc. - How novel?
- How often?
![Page 20: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/20.jpg)
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a
parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc.
Interaction evidence of relationshipsbetween rules, structures and events - Does X affect
Y?
- How novel?
- How often?
![Page 21: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/21.jpg)
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a
parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc.
Interaction evidence of relationshipsbetween rules, structures and events
• Lexical searches may also be made more precise using the grammatical analysis
- Does X affect Y?
- How novel?
- How often?
![Page 22: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/22.jpg)
A simple research questionA simple research question
• Let us consider the following question:
• Do women speak or write more words than men in the ICE-GB corpus?
• What do you think?
• How might we find out?
![Page 23: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/23.jpg)
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
![Page 24: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/24.jpg)
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
![Page 25: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/25.jpg)
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
– Variable query:• SPEAKER GENDER = f, m, <unknown>
combine these3 queries}
![Page 26: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/26.jpg)
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
– Variable query:• SPEAKER GENDER = f, m, <unknown>
F M <unknown> TOTALTOTAL 275,999 667,934 93,355 1,037,288 spoken 174,499 439,741 1,076 615,316 written 101,500 228,193 92,279 421,972
combine these3 queries}
![Page 27: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/27.jpg)
ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category
spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly
authored
– female/male ratio varies slightly
0 0.2 0.4 0.6 0.8 1
TOTAL
spoken
written femalefemale
malemale
p
![Page 28: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/28.jpg)
ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category
spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly
authored
– female/male ratio varies slightly
0 0.2 0.4 0.6 0.8 1
TOTAL
spoken
written femalefemale
malemale
p
pp (female)(female) = words spoken by = words spoken by women /women /
total words (excluding total words (excluding <unknown>)<unknown>)
![Page 29: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/29.jpg)
pp = Probability = Proportion = Probability = Proportion
• We asked ourselves the following question:– Do women speak or write more words
than men in the ICE-GB corpus?– To answer this we looked at the proportion
of words in ICE-GB that are produced by women (out of all words where the gender is known)
![Page 30: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/30.jpg)
pp = Probability = Proportion = Probability = Proportion
• We asked ourselves the following question:– Do women speak or write more words than men in
the ICE-GB corpus?– To answer this we looked at the proportion of words in
ICE-GB that are produced by women (out of all words where the gender is known)
• The proportion of words produced by women can also be thought of as a probability:– What is the probability that, if we were to pick
any random word in ICE-GB (and the gender was known) it would be uttered by a woman?
![Page 31: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/31.jpg)
Another research questionAnother research question
• Let us consider the following question:
• What happens to modal shall vs. will over time in British English?– Does shall increase or decrease?
• What do you think?
• How might we find out?
![Page 32: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/32.jpg)
Lets get some dataLets get some data
• Open DCPSE with ICECUP– FTF query for first person declarative shall:
• repeat for will
![Page 33: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/33.jpg)
Lets get some dataLets get some data
• Open DCPSE with ICECUP– FTF query for first person declarative shall:
• repeat for will– Corpus Map:
• DATE Do the first set of queries and then drop into Corpus
Map}
![Page 34: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/34.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
shallshall = 100% = 100%
shallshall = 0% = 0%0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
(Aarts et al. 2013)
![Page 35: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/35.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
(Aarts et al. 2013)
shallshall = 100% = 100%
shallshall = 0% = 0%
![Page 36: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/36.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
Is shall going up or down?
(Aarts et al. 2013)
shallshall = 100% = 100%
shallshall = 0% = 0%
![Page 37: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/37.jpg)
Is Is shall shall going up or down? going up or down?
• Whenever we look at change, we must ask ourselves two things:
![Page 38: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/38.jpg)
Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
![Page 39: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/39.jpg)
Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
How confident are we in our results?– Is the change big enough to be reproducible?
![Page 40: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/40.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
![Page 41: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/41.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Really? Not 77.28, or 77.26?
![Page 42: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/42.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
![Page 43: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/43.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Sounds defensible. But how confident can we be in this number?
![Page 44: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/44.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
![Page 45: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/45.jpg)
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
Finally we have a credible range of values - needs a footnote* to explain how it was calculated.
![Page 46: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/46.jpg)
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
![Page 47: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/47.jpg)
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
![Page 48: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/48.jpg)
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
• Now we are asking about “British English”
?
![Page 49: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/49.jpg)
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
• Now we are asking about “British English”– We want to draw an inference
• from the sample (in this case, DCPSE)• to the population (similarly-sampled BrE utterances)
– This inference is a best guess– This process is called inferential statistics
![Page 50: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/50.jpg)
Basic inferential Basic inferential statisticsstatistics
• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?
• Suppose we repeat the experiment• Will we get the same result again?
![Page 51: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/51.jpg)
Basic inferential Basic inferential statisticsstatistics
• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?
• Suppose we repeat the experiment• Will we get the same result again?
• Let’s try…– You should have one coin– Toss it 10 times– Write down how many heads you get– Do you all get the same results?
![Page 52: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/52.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 1
x
531 7 9
• We toss a coin 10 times, and get 5 heads
X
![Page 53: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/53.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 4
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 54: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/54.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 8
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 55: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/55.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 12
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 56: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/56.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 16
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 57: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/57.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 20
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 58: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/58.jpg)
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 26
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
![Page 59: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/59.jpg)
The Binomial distributionThe Binomial distribution• It is helpful to express x as the probability of choosing a head, p, with expected mean P
• p = x / n– n = max. number of
possible heads (10)
• Probabilities are inthe range 0 to 1=percentages
(0 to 100%)
F
p
0.50.30.1 0.7 0.9
P
![Page 60: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/60.jpg)
The Binomial distributionThe Binomial distribution
• Take-home point:– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!
• Estimating the confidence you have in your results is essential
F
p
P
0.50.30.1 0.7 0.9
p
![Page 61: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/61.jpg)
The Binomial distributionThe Binomial distribution
• Take-home point:– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!
• Estimating the confidence you have in your results is essential
– We want to makepredictions about future runs of the same experiment
F
p
P
p
0.50.30.1 0.7 0.9
![Page 62: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/62.jpg)
Binomial Binomial Normal Normal
• The Binomial (discrete) distribution is close to the Normal (continuous) distribution
x
F
0.50.30.1 0.7 0.9
![Page 63: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/63.jpg)
The central limit theoremThe central limit theorem
• Any Normal distribution can be defined by only two variables and the Normal function z
z . S z . S
F
– With more data in the experiment, S will be smaller
p0.50.30.1 0.7
population
mean P
standard deviationS = P(1 – P) / n
![Page 64: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/64.jpg)
The central limit theoremThe central limit theorem
• Any Normal distribution can be defined by only two variables and the Normal function z
z . S z . S
F
2.5% 2.5%
population
mean P
– 95% of the curve is within ~2 standard deviations of the expected mean
standard deviationS = P(1 – P) / n
p0.50.30.1 0.7
95%
– the correct figure is 1.95996!
= the critical value of z for an error level of 0.05.
![Page 65: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/65.jpg)
The single-sample The single-sample zz test...test...
• Is an observation p > z standard deviations from the expected (population) mean P?
z . S z . S
F
P2.5% 2.5%
p0.50.30.1 0.7
observation p• If yes, p is
significantly different from P
![Page 66: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/66.jpg)
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P
– We want to plot the interval about p
z . S z . S
F
P
p0.50.30.1 0.7
2.5% 2.5%
![Page 67: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/67.jpg)
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P
– We want to plot the interval about p
w+
F
P2.5% 2.5%
p0.50.30.1 0.7
observation p
w–
95%
![Page 68: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/68.jpg)
...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the
Wilson score interval
• This interval reflects the Normal interval about P:
• If P is at the upper limit of p,p is at the lower limit of P
(Wallis, 2013)
F
P2.5% 2.5%
p
w+
observation p
w–
0.50.30.1 0.7
![Page 69: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/69.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Simple test: – Compare p for
• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)
– We get the following data
– We may plot the probabilityof shall being selected,with Wilson intervals
LLC ICE-GB totalshall 110 40 150will 78 58 136total 188 98 286
0.0
0.2
0.4
0.6
0.8
1.0
LLC ICE-GB
p(shall | {shall, will})
![Page 70: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/70.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Simple test: – Compare p for
• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)
– We get the following data
– We may plot the probabilityof shall being selected,with Wilson intervals
0.0
0.2
0.4
0.6
0.8
1.0
LLC ICE-GB
p(shall | {shall, will})LLC ICE-GB total
shall 110 40 150will 78 58 136total 188 98 286
May be input in a
2 x 2 chi-square test
- or you can check Wilson intervals
![Page 71: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/71.jpg)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
• Small amounts of data / year
![Page 72: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/72.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})• Small amounts
of data / year
• Confidence intervals identify the degree of certainty in our results
![Page 73: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/73.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts of data / year
• Confidence intervals identify the degree of certainty in our results
• Highly skewed p in some cases
– p = 0 or 1 (circled)
![Page 74: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/74.jpg)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts of data / year
• Confidence intervals identify the degree of certainty in our results
• We can now estimate an approximate downwards curve
(Aarts et al. 2013)
![Page 75: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/75.jpg)
Recap Recap • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
How confident are we in our results?– Is the change big enough to be reproducible?
![Page 76: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/76.jpg)
ConclusionsConclusions
• An observation is not the actual value – Repeating the experiment might get different results
• The basic idea of these methods is – Predict range of future results if experiment was
repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution– Approximated by Normal distribution – many uses
• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare
an observation with an expected baseline• Use 22 tests or two independent sample z tests to
compare two observed samples
![Page 77: Simple Statistics for Corpus Linguistics](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681447e550346895db11512/html5/thumbnails/77.jpg)
ReferencesReferences
• Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.– Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:
methodological issues in investigating current change. Chapter 2.– Levin, M. 2013. The progressive in modern American English.
Chapter 8.
• Wallis, S.A. 2013. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, 178-208.
• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.
• NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com