Statistika za filologe
-
Upload
stevanija-stevanovic -
Category
Documents
-
view
216 -
download
0
Transcript of Statistika za filologe
-
8/21/2019 Statistika za filologe
1/17
Statistics for Linguists: A Tutorial
Mark Dras
Centre for Language TechnologyMacquarie University
HCSNet Summerfest
28 November 2006
1 /6 7
Tutorial Structure
primarily as some examples of tasks linguists might be interested
in
within these, statistical ideas that are useful
hypothesis testingvarious statistical measures (2, likelihood ratios, .. . )statistical distributionssome other useful ideas (e.g. Latent Semantic Analysis)
basic material taken from Manning and Schutze [1999]
another useful overview: Krenn and Samuelsson [1997]
2 / 6 7
Collocations
Outline
1 Collocations
Frequency
Hypothesis Testing (+ background: Basic Probability Theory)
The t-TestPearsons Chi-Square Test (+ background: Distributions)
Likelihood Ratio Test (+ background: Conditional Probability)
Fishers Exact Test
2 Verb Subcategorisation
Precision and Recall
3 Semantic Similarity
Latent Semantic Indexing
4 Register Analysis
5 References
3 /6 7
Collocations
Definitions
collocation: an expression consisting of two or more words that
correspond to some conventional way of saying things
Firth (1957): Collocations of a given word are statements of the
habitual or customary places of that word
examples:
noun phrases: strong tea,weapons of mass destructionphrasal verbs: to make upother standard phrases: the rich and powerful
has limited compositionality
more than an idiomexample:international best practice
4 / 6 7
-
8/21/2019 Statistika za filologe
2/17
Collocations Frequency
Frequency
most basic idea: start with a corpus, and count the relevant
frequencies
if looking for two-word collocations, just count frequencies of pairsof adjacent wordsobvious problem: get lots of useless high frequency wordsfrom New York Times:
C(w1w2) w1 w280871 of the
58841 in the
26430 to the
21842 on the
. . . . . . . . .
1 26 22 f ro m t he
1 14 28 N ew Yo rk
10007 he said
. . . . . . . . .
5 /6 7
Collocations Frequency
Frequency
basic idea of frequency still maybe OK with conditions:
1 use when looking to verify specific alternatives or patterns; or2 add a filter
6 / 6 7
Collocations Frequency
Example #1a: Eggcorns
described in Language Log
http://itre.cis.upenn.edu/myl/languagelog/
idea: something like a mistaken but plausible reanalysis of a word
or phrase
examples: to step foot in,baited breath,free reign,hone in,ripe
with mistakes,for all intensive purposes,manner from Heaven,
give up the goat
like folk etymology, a malapropism, or a mondegreen; but not
collection by Chris Waigl
http:www.eggcorns.lascribe.net
interested in seeing whether eggcorn is gaining currency
compare by Google hitse.g.inclement weather(173K whG) vs inclimate weather(11KwhG) vsincliment weather(719 whG)
7 /6 7
Collocations Frequency
Example #1b: Snowclones
also defi ned at Language Log
idea: adaptable cliche frames
examples: Have X, will travel;X is the new Y;X, we have a
problem
again, use Google hits
8 / 6 7
-
8/21/2019 Statistika za filologe
3/17
Collocations Frequency
Aside: Google Counts
theres been discussion about the reliability of Google-derived
frequencies
see Jean Veronis blog and Language Log
example of problem
Google query: junco partner lyrics (9440 whG)Google query: junco partner lyrics connick (279 whG)
Google query: junco partner lyrics -connick (930 whG)frequency counts over 100K generally regarded as unreliable, but
may also be the case for smaller
problems appear to be related to Googles indexing, and treatment
of near-identical page matches
9 /6 7
Collocations Frequency
Adding Filters
alternatively, if the problem is to fi nd rather than to verify, can use
fi lters based on part of speech
for instance, in previous example of extracting collocations, can
use patterns like the following:
Tag pattern Example
A N linear function
N N r egr es si on co ef fi c ie nt sA A N Gaussian random variable
N P N d eg re es o f f re ed om
. . . . . .
1 0 /6 7
Collocations Frequency
Adding Filters
applied to same New York Times text, get
C(w1w2) w1 w2 Tag Pattern11487 New York A N
7261 United States A N5412 Los Angeles N N
3301 last year A N. . . . . . . . . . . .
1074 chief executive A N1 07 3 r ea l e st at e A N
. . . . . . . . . . . .
similarly, given particular adjectives, can fi nd most frequent
co-occurring nouns:
w C(strong, w) w C(powerful, w)support 50 force 13safety 22 computers 10sales 21 position 8. . . . . . . . . . . .man 9 man 8. . . . . . . . . . . .
good, but still not perfect
e.g.manoccurring in both listswant to ignore ifmanis relatively common by itself
11/67
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Random Variable
the probability of an event is the likelihood that it will occur,
represented by a number between 0 and 1:
probability 0: impossibility
probability 1: certaintyprobability 0.5: equally likely to occur as not
a random variable ranges over all the possible types of outcomes
for the event being measured . . . example:
random variableX= the result of rolling a dieP(X= 1)= the probability of the die showing 1 = 1/6P(X= 1) = P(X =2) = . . .= P(X =6)= 1/6
properties
the probability of an outcome is always between 0 and 1the sum of probabilities of all outcomes is 1
1 2 /6 7
C ll ti H th i T ti ( b k d B i P b bilit Th ) C ll ti H th i T ti ( b k d B i P b bilit Th )
-
8/21/2019 Statistika za filologe
4/17
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Summary Measures for Random Variables
the (arithmetic) mean, or average:
E(X) =i
xiP(xi)
from the die example,E(X) = 1.(1/6) + 2.(1/6) +. . .+ 6.(1/6) = 3.5
the variance, to measure the spread:
Var(X) = E(XE(X))2 =E(X2)(E(X))2 =i
x2i P(xi)(E(X))2
from the die example,Var(X) = 1.(1/6) + 4.(1/6) +. . .+ 36.(1/6) 3.52 =2.91
note that these apply to the whole population
13/67
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Example
imagine a six-sided die where each outcome wasnt equally likely,
but hadP(X=1) = P(X=6) = 1/100,P(X=2) = P(X =5) = 4/100,P(X=3) = P(X =4) = 45/100
E(X) = 1.(1/100) + 2.(4/100) +. . .+ 6.(1/100) = 3.5, as beforeVar(X) = 1.(1/100) + 4.(4/100) +. . .+ 36.(1/100) 3.52 =0.53
1 4 /6 7
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Estimating Probabilities
Maximum Likelihood Estimator (MLE)
used to estimate the theoretical probability from a sampleif a specific event has occurred mtimes out ofnoccasions, the
MLE probability ism/nthe larger the number of occasions measured, the more accuratethe MLE
sample mean and variance given observationsxi
sample mean
x= 1
n
ni=1
xi
sample variance
s2
=n
i=1(xi x)2
n 1
15/67
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Estimating Probabilities
imagine we dont know the population probabilities
we want to estimate them from a sample
die outcome number of times rolled
1 162 183 134 165 196 18
100
x= 16 1+. . .+ 18 6
100 =3.58
s2 = 16 (1 3.58)2 +. . .+ 18 (6 3.58)2
99 3.05
1 6 /6 7
Collocations Hypothesis Testing (+ background: Basic Probability Theory) Collocations Hypothesis Testing (+ background: Basic Probability Theory)
-
8/21/2019 Statistika za filologe
5/17
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Estimating Probabilities
problem: sparse data
its more difficult to estimate the probability of a rare eventif the corpus doesnt register, say, a rare word, the MLE for the wordis 0
possible solution: smoothing
even though the MLE seems like the right way of estimatingprobabilities, its not always the right oneinfrequent events can get too little probability masscan redistribute some of the probabilities
17/67
Collocations Hypothesis Testing (+ background: Basic Probability Theory)
Hypothesis Testing
using frequencies, as previously, we might decide that new
companiesis a collocation because if has a high frequency
however, we dont really think it is one; maybe its becausenewandcompaniesare individually frequent, and just appear together bychance
hypothesis testing is a way of assessing whether something is due
to chance
has the following procedure:
formulate a NULL HYPOTHESISH0, that there is no associationbeyond chancecalculate the probabilitypthat the event would occur if H0were truethen, ifpis too low (usually 0.05 or smaller), reject H0; retain it as apossibility otherwise
1 8 /6 7
Collocations The t-Test
The t-Test
want a test that will say how likely or unlikely a certain event is to
occur
the t-test compares a sample mean with a population mean,
relative to the samples variability
t= x
s2
N
whereNis the size of the sample, and is the population meanaccording to the null hypothesis
look up this t-value against a table
table gives t-value for a given confidence level and a givennumber of degrees of freedom ( = N 1)
p 0 .0 5 0 .0 1 0. 00 5 0 .0 01
d.f. 1 6.314 31.82 63.66 318.31 0 1 .81 2 2 .7 64 3. 16 9 4 .1 442 0 1 .72 5 2 .5 28 2. 84 5 3 .5 52 1 .6 45 2 .3 26 2 .5 76 3 .0 91
19/67
Collocations The t-Test
The t-Test
non-linguistic example:
H0: the mean height of a population of men is 158cm (vs apopulation of shorter men)
sample data: size 200 with x= 169 ands2
=2600
t= 169 158
2600
200
3.05
looking up the table, t>2.576, so we can reject H0 with 99.5%confi dence
p 0 .0 5 0. 01 0.005 0.001d.f. 1 6.314 31.82 63.66 318.3
10 1 .8 12 2. 76 4 3 .1 69 4 .1 4420 1 .7 25 2. 52 8 2 .8 45 3 .5 52 1.645 2.326 2.576 3.091
2 0 /6 7
Collocations Pearsons Chi-Square Test (+ background: Distributions) Collocations Pearsons Chi-Square Test (+ background: Distributions)
-
8/21/2019 Statistika za filologe
6/17
Collocations Pearson s Chi Square Test (+ background: Distributions)
Distributions
a PROBABILITY DISTRIBUTION FUNCTION is a function describing
the mapping from random variable values to probabilities
these can be either discrete (from a finite set) or continuous
weve already seen a UNIFORM DISTRIBUTION (the original die
example)
this was a discrete function
P(X =x) = 1
n(wherenis the number of outcomes)
21/67
Collocations Pearson s Chi Square Test (+ background: Distributions)
Gaussian Distribution
another important (continuous) one is the G AUSSIAN( OR
NORMAL) DISTRIBUTION
defined by the function f(x) = 1
2
exp(x)222
population mean is, variance
a lot of data can be assumed to have this distribution, e.g. heights
in a population
the t-test described previously assumes a normal distribution
2 2 /6 7
Collocations Pearsons Chi-Square Test (+ background: Distributions)
Bernoulli Distribution
the discrete BERNOULLI DISTRIBUTIONmeasures the probability
of success in a yes/no experiment, with this probability called p
defined byP(X =1) = p, P(X =0) = 1 p
population mean isp, variance isp(1 p)
23/67
Collocations Pearsons Chi-Square Test (+ background: Distributions)
Binomial Distribution
the discrete BINOMIAL DISTRIBUTION measures the probability of
the number of successes in a sequence of nindependent yes/no
experiments (Bernoulli distributions), each of which has probability
pdefined byP(X= x) =
nk
pk(1 p)nk
population mean isnp, variance isnp(1 p)
models things like the probability of gettingkheads fromntossesof a fair coinfor largen, can be approximated by the normal distribution 2 4 /6 7
-
8/21/2019 Statistika za filologe
7/17
-
8/21/2019 Statistika za filologe
8/17
Collocations Pearsons Chi-Square Test (+ background: Distributions) Collocations Pearsons Chi-Square Test (+ background: Distributions)
-
8/21/2019 Statistika za filologe
9/17
Pearsons Chi-Square Test
as for the t-test, the 2 has an associated number of degrees offreedom
for a table of dimensions r c, there are(r 1)(c 1)degrees offreedom
we check the distribution for 2:
p 0.99 0.95 0.10 0.05 0 .0 1 0 .0 05 0 .0 01
d.f. 1 0.00016 0.0039 2.71 3.84 6 .6 3 7. 88 10 .8 32 0.20 0.10 4.60 5.99 9.21 10.60 13.823 0.115 0.35 6.25 7.81 11.34 12.84 16.274 0.297 0.71 7.78 9.49 13.28 14.86 18.47
100 70.06 77.93 118.5 124.3 135.8 140.2 149.4
theX2 value is less than for = 0.05, so we wouldnt rejectH0:i.e. we wouldnt takenew companiesas a collocation, as before
with the t-test
33/67
Comparison: Chi-Square vs t-Test
for the previous example, theres quite a lot of overlap
for example, the top 20 bigrams according to the t-test are the sameas the top 20 for2
however,2 also appropriate for large probabilities, where thenormality assumption of the t-test fails
3 4 /6 7
Collocations Likelihood Ratio Test (+ background: Conditional Probability)
Conditional Probability
weve already in fact used the notion of independent events
two events are independent of each other if the occurrence of onedoes not affect the probability of the occurrence of the other
tossing a coin and winning the lottery: independentspeeding and having an accident: not independent
conditional probability: the probability that one event occurs given
that another event occurs
35/67
Collocations Likelihood Ratio Test (+ background: Conditional Probability)
Example
the following table shows the weather conditions for 100 horse
races and how many times Harry won:
rain shine
win 15 5
no win 15 65
Harry won 20 out of 100 races: P(win) = 0.2 (by MLE)
the conditional probability of Harry winning given rain is
P(win| rain) = 15/30= 0.5
compare this with the2 test: under the null hypothesis, theobserved data was compared against the situation where the
words were independent
3 6 /6 7
Collocations Likelihood Ratio Test (+ background: Conditional Probability)
Lik lih d R iCollocations Likelihood Ratio Test (+ background: Conditional Probability)
Lik lih d R i
-
8/21/2019 Statistika za filologe
10/17
Likelihood Ratio
another approach to hypothesis testing
more appropriate to sparse data than 2
more interpretable also: says how much more likely one hypothesisis than another
here, examine explicitly two hypotheses to explain bigramw1w2
Hypothesis 1: P(w2 |w1) = p= P(w2|w1)Hypothesis 2: P(w2 |w1) = p1=p2= P(w2|w1)
Hypothesis 1 represents independence ofw1 and w2; Hypothesis
2 represents dependence (and hence a possible collocation)
37/67
Likelihood Ratio
well use the usual MLEs for p,p1,p2, and writingc1, c2,c12 for thenumber of occurrences ofw1,w2,w1w2
p= x2
N, p1=
c12
c1,p2 =
c2 c12N c1
well also use the notation for a binomial distribution
b(k; n,p) =n
k
pk
(1 p)nk
now, the likelihoods are
for Hypothesis 1,
L(H1) = b(c12; c1, p)b(c2 c12; N c1,p)
for Hypothesis 2,
L(H2) = b(c12; c1, p1)b(c2 c12; N c1,p2)
3 8 /6 7
Collocations Likelihood Ratio Test (+ background: Conditional Probability)
Likelihood Ratio
the log of the likelihood ratio is then
log= logL(H1)
L(H2)
for bigrams ofpowerful:
2log C(w1) C(w2) C(w1w2) w1 w21291.42 1 2593 932 150 most powerful
99.31 379 932 10 p olitically p owerful82.96 932 934 10 p owerful computers80.39 932 3424 13 powerful force57.27 932 291 6 powerful symbol
. . . . . . . . . . . . . . . . . .
the value 2loghas a2 distribution
so you can do hypothesis testing using the2 table
39/67
Collocations Likelihood Ratio Test (+ background: Conditional Probability)
Comparison: Chi-Square and Likelihood Ratio
the likelihood ratio has an intuitive meaning
from the previous table, the bigram powerful symbolise0.557.27 2.729 1012 times more likely to occur than would be
expected by the individual words alonecomparison carried out by Dunning [1993]
2 tends to be less accurate with sparse data
as a rule of thumb, need large sample, and counts in each cell (i.e.
occurrences of words or bigrams) of at least 5
the events were interested in textindividual words orn-gramsare in fact often less frequent than this: related to theZipfian distribution of wordsas an example, Dunning selected words from a 500,000 wordcorpus with frequences of between 1 and 4; these included wordslike abandonment,clause,meat,poiand understatement
the log likelihood is more accurate here (but still needs counts of atleast 1)
4 0 /6 7
Collocations Fishers Exact Test
Fi h E t T tCollocations Fishers Exact Test
Fi h E t T t
-
8/21/2019 Statistika za filologe
11/17
Fishers Exact Test
the previous tests have all been PARAMETRICtests: that is, they
assume some distribution
its possible to use a N ON -PARAMETRICtest, which makes no
assumptions
trade-off is that its typically more time-consuming to calculate,
and is only feasible for smaller amounts of data
Fishers Exact Test computes the signifi cance of an observed
table by exhaustively computing the probability of every table that
would have the same marginal totals
suggested as an alternative to the previous tests by Pedersen
[1996]
41/67
Fishers Exact Test
consider again a 2 2 contingency table
w1 = new w 1 =neww2 = companies E 1,1 E1,2
(new companies) ( e.g.old companies)w2 =companies E 2,1 E2,2
(e.g.new machines) (e.g.old machines)
the probability of obtaining any such set of values is
p=E1,1+E1,2
E1,1E2,1+E2,2
E2,1E1,1+E1,2+E2,1+E2,2
E1,1+E2,1
4 2 /6 7
Verb Subcategorisation
Outline
1 Collocations
Frequency
Hypothesis Testing (+ background: Basic Probability Theory)
The t-TestPearsons Chi-Square Test (+ background: Distributions)
Likelihood Ratio Test (+ background: Conditional Probability)
Fishers Exact Test
2 Verb Subcategorisation
Precision and Recall
3 Semantic Similarity
Latent Semantic Indexing
4 Register Analysis
5 References
43/67
Verb Subcategorisation
Verb Subcategorisation
verbs express their semantic arguments with different syntactic
means
the class of verbs with semantic arguments themeandrecipienthas a subcategory expressing these via a direct object and aprepositional phrase: he donated a large sum of money to thechurcha second subcategory permits double objects: he gave the churcha large sum of money
these subcategorisation frames are typically not in dictionaries
might be interested in identifying them via statistics
Brent [1993] developed the system Lerner to assign one of six
frames to verbsDescripti on G ood Example Bad ExampleNP only greet them *arr ive themtensed clause hope hell attend *want hell attendi nfi n it iv e h op e t o a tt en d * gr ee t t o a tt en d
NP & clause tell him hes a fool *yell him hes a foolNP & infi nitive want him to attend *hope him to attendNP & NP t el l him t he story *shout him t he story
4 4 /6 7
Verb Subcategorisation
Algorithm for Learning Subcat FramesVerb Subcategorisation
Hypothesis Testing
-
8/21/2019 Statistika za filologe
12/17
Algorithm for Learning Subcat Frames
Lerner had two steps:
1 Define cues. Define a regular pattern of words and syntacticcategories which indicates the presence of the frame with highcertainty (probability of error). For a particular cue cjwe define theprobability of errorjthat indicates how likely we are to make amistake if we assign framef to verbvbased on cuecj.
2 Do hypothesis testing. Initially assume the frame is notappropriate for the verb: this is the null hypothesisH0. We rejectH0
if the cuecjindicates with high probability that H0 is wrong
example: cue for frame NP only (transitive verb)
(OBJ| SUBJ OBJ| CAP) (PUNC| CC)OBJ = accusative case personal pronouns; SUBJ OBJ =nominative or accusative case personal pronouns; CAP =capitalised word; PUNC = punctuation; CC = subordinatingconjunctionpositive indicator for transitive verb: consider . . . greet/V Peter/CAP,/PUNC . . .
45/67
Hypothesis Testing
suppose verbvioccurs a total ofntimes in the corpus and that
there arem noccurrences with a cue for frame fj
assume also some errorjin inferring a frame fjfrom cue cj
this suggests a binomial distribution
then reject null hypothesisH0thatvidoes not permit fjwith the
following probability of error:
pE=P(vidoes not permit fj |C(vi, cj) m) =n
r=m
n
r
rj(1j)
nr
various values forjwere assessed
4 6 /6 7
Verb Subcategorisation Precision and Recall
Precision and Recall
typically, when building a statistical model to do something, you
want to evaluate how suitable it is
for example, how well it performs a simple task
the measures of PRECISIONand RECALLare one way of doing that
imagine you have a system for sorting your objects of interest into
two piles, relevant and irrelevant
system can make two types of errors: classifying a relevant objectas irrelevant, or an irrelevant one as relevantsystem decisions can then be broken into four categories: truepositive (TP), false positive (FP), false negative (FN), true negative(TN)
actually:system predicts: relevant irrelevantrelevant TP FPirrelevant FN TN
47/67
Verb Subcategorisation Precision and Recall
Precision and Recall
precision is the proportion of system-predicted relevant objects
that are correct:
PRE=
TP
TP+ FPrecall is the proportion of actually relevant objects that the system
managed to predict as relevant
REC= TP
TP+ FNexample: theres a set of 200 documents, of which 40 are actually
relevant; your system says that 50 are relevant, including 20 of the
ones that actually are
actually:system predicts: relevant irrelevant
relevant TP = 20 FP = 30ir rel eva nt FN = 20 TN = 13 0
then,PRE= 2050
andREC= 2040 4 8 /6 7
Verb Subcategorisation Precision and Recall
Verb Subcategorisation: Lerner AccuracyVerb Subcategorisation Precision and Recall
F Measure
-
8/21/2019 Statistika za filologe
13/17
Verb Subcategorisation: Lerner Accuracy
for Lerner, precision and recall values were calculated for various
j
this is the table for the tensed clause frame
j TP FP TN FN MC %MC PRE REC
.0312 13 0 30 20 20 32 1.00 .39
.0156 19 0 30 14 14 22 1.00 .58
.0078 22 1 29 11 12 19 .96 .67
.0039 25 1 29 8 9 14 .96 .76
.0020 27 3 27 6 9 14 .90 .82
.0010 29 5 25 4 9 14 .85 .88
.0005 31 8 22 2 10 16 .79 .94.0002 31 13 17 2 15 24 .70 .94
.0001 33 19 11 0 19 30 .63 1.00
MC is total misclassified
49/67
F-Measure
theres typically a trade-off between precision and recall
there are a number of ways of combining them into a single
measure
one is the F-measure, the weighted harmonic mean of the two:
F =2 PRE REC
PRE+ REC
5 0 /6 7
Semantic Similarity
Outline
1 Collocations
Frequency
Hypothesis Testing (+ background: Basic Probability Theory)
The t-TestPearsons Chi-Square Test (+ background: Distributions)
Likelihood Ratio Test (+ background: Conditional Probability)
Fishers Exact Test
2 Verb Subcategorisation
Precision and Recall
3 Semantic Similarity
Latent Semantic Indexing
4 Register Analysis
5 References
51/67
Semantic Similarity
Semantic Similarity
there are a number of resources that group words together by
semantic relatedness
examples are thesauruses, Wordnet
semantic relations are synonymy, hypernymy, etce.g. dogand caninemight be in a class together; this might be ahyponym of a class corresponding to animal
you might want to automatically derive classes to capture relations
for when you have a new unknown word: e.g. if inSusan had nevereaten a fresh durian beforeyou dont know what kind of thingdurianisif you want types of classes other than the standard ones
5 2 /6 7
Semantic Similarity Latent Semantic Indexing
Latent Semantic Indexing (LSI)Semantic Similarity Latent Semantic Indexing
Example
-
8/21/2019 Statistika za filologe
14/17
Latent Semantic Indexing (LSI)
in LSI, we look at the interaction of terms and documents
the purpose of this interaction is twofold
to have the documents tell us which terms should be groupedtogetherto have the grouped-together terms tell us about the similarity of thedocuments
this interaction is described by a matrix, and the grouping is
carried out by a process called Singular Value Decomposition(SVD)
53/67
Example
say we have 5 terms of interestcosmonaut,astronaut,moon,
car,truckand 6 documents
we describe their interaction by a matrixA, where cellaijcontains
the count of termi in documentj
A=
d1 d2 d3 d4 d5 d6cosmonaut 1 0 1 0 0 0
astronaut 0 1 0 0 0 0moon 1 1 0 0 0 0
car 1 0 0 1 1 0
truck 0 0 0 1 0 1
this can be thought of as a fi ve-dimensional space (defi ned by the
terms) with six objects in that space (the documents)
what we want to do is reduce the dimensions, thus grouping
similar terms
5 4 /6 7
Semantic Similarity Latent Semantic Indexing
Dimensionality Reduction
there are many possible types of dimensionality reduction
LSI chooses the mapping that means that the reduced dimensions
correspond to the greatest axes of variation
that is, if the new dimensions are numbered 1 . . . k, dimension 1captures the greatest amount of commonality, dimension 2 thesecond greatest, and so on
this process is carried out by the matrix operation called Singular
Value Decomposition
here, the term-by-document matrix A tdis decomposed into three
other matrices
Atd=TtnSnn(Ddn)T
this decomposition is (almost) unique
55/67
Semantic Similarity Latent Semantic Indexing
Example
T =
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
cosmonaut 0.44 0.30 0.57 0.58 0.25
astronaut 0.13 0.33 0.59 0.00 0.73moon 0.48 0.51 0.37 0.00 0.61car 0.70 0.35 0.15 0.58 0.16truck 0.26 0.65 0.41 0.58 0.09
consider the columns .. .
5 6 /6 7
Semantic Similarity Latent Semantic Indexing
ExampleSemantic Similarity Latent Semantic Indexing
Example
-
8/21/2019 Statistika za filologe
15/17
Example
S=
2.16 0.00 0.00 0.00 0.000.00 1.59 0.00 0.00 0.000.00 0.00 1.28 0.00 0.000.00 0.00 0.00 1.00 0.000.00 0.00 0.00 0.00 0.39
this matrix embodies the weight of the dimensions
it always goes largest to smallest
57/67
Example
DT =
d1 d2 d3 d4 d5 d6Dim 1 0.75 0.28 0.20 0.45 0.33 0.12Dim 2 0.29 0.53 0.19 0.63 0.22 0.41Dim 3 0.28 0.75 0.45 0.20 0.12 0.33Dim 4 0.00 0.00 0.58 0.00 0.58 0.58Dim 5 0.53 0.29 0.63 0.19 0.41 0.22
5 8 /6 7
Semantic Similarity Latent Semantic Indexing
Example
so far weve just transformed the dimensions; now to reduce
for this example, decide to reduce to 2 dimensions
to look at documents, combine this reduced dimensionality with
the weighting of the dimensions
derive new matrixB2d =S22DT2d
B=
d1 d2 d3 d4 d5 d6Dim 1 1.62 0.60 0.44 0.97 0.70 0.26Dim 2 0.46 0.84 0.30 1.00 0.35 0.65
59/67
Semantic Similarity Latent Semantic Indexing
Conceptually . . .
can imagine that the terms are made up of semantic particles
perhaps along the lines of Wierzbickas semantic primitiveshowever, not defined a priori; only a consequence of the relations inthe given set of documents
LSI rearranges things so that the terms with greatest number of
semantic particles in common are grouped
6 0 /6 7
Register Analysis
OutlineRegister Analysis
Register Analysis
-
8/21/2019 Statistika za filologe
16/17
Outline
1 Collocations
Frequency
Hypothesis Testing (+ background: Basic Probability Theory)
The t-Test
Pearsons Chi-Square Test (+ background: Distributions)
Likelihood Ratio Test (+ background: Conditional Probability)
Fishers Exact Test
2 Verb Subcategorisation
Precision and Recall
3 Semantic Similarity
Latent Semantic Indexing
4 Register Analysis
5 References
61/67
Register Analysis
work done by Douglas Biber
these notes from Biber [1993]
idea: different registers have systematic patterns of variation
e.g. professional letters vs academic prosecan do descriptive analyses, based on frequency or proportions ofselected characteristicshowever, may also want to identify groups of characteristics
distinguishing registers
6 2 /6 7
Register Analysis
Descriptive Analysis
example: from Brown corpus, mean frequencies of three
dependent clause types (per 1000 words)
relative causative adverbial that complementr eg is te r c la us es s ub or di na te c la us es c la us espress reports 4.6 0.5 3.4offi cial documents 8.6 0.1 1.6conversations 2.9 3.5 4.1prepared speeches 7.9 1.6 7.6
from this, can see e.g. that relative clauses are common in offi cial
documents and prepared speeches relative to conversation
may be interested in grouping many of these characteristics of text
together
63/67
Register Analysis
Dimension Identification
Biber carried out a quantitative analysis of 67 linguistic features in
the LOB and London-Lund corpora
features included: tense and aspect markers, place and timeadverbials, pronouns and pro-verbs, nominal forms, prepositionalphrases, adjectives, lexical specificity, lexical classes (e.g. hedges,emphatics), mmodals, specialised verb classes, reduced forms anddiscontinuous structures, passives, stative forms, dependentclauses, coordination, and questionsfrequencies were counted, and normalised to per-1000 values
then, FACTOR A NALYSIS was carried out
this is a dimensionality reduction procedure very similar to LSIthe dimensions similarly end up in decreasing order of explanatorypower
6 4 /6 7
Register Analysis
Dimension IdentificationReferences
Outline
-
8/21/2019 Statistika za filologe
17/17
after inspecting the results of the factor analysis, Biber
interpretively labelled the fi rst fi ve dimensions:
1 Informational vs Involved Production2 Narrative vs Nonnarrative Concerns3 Elaborated vs Situation-Dependent Reference4 Overt Expression of Persuasion5 Abstract vs Nonabstract Style
example of features associated with Dimension 1:f un ctio ns lin gu is tic fe atu re s ch ar act er is tic r egi st er sM on ol og ue n ou ns , a dj ec ti ve s i nfo rm at io na l ex po si ti onCareful Production prepositionalphrases e.g. offi cial documentsInformational long words academic proseFacelessI nt eracti ve 1st and 2nd person pronouns conversat ions(Inter)personalFocus questions, reductions (personalletters)I nvol ved st ane ver bs, h ed ges ( publ ic c onve rs atio ns )Personal Stance emphati csOn-Line Production adverbial subordination
65/67
1 Collocations
Frequency
Hypothesis Testing (+ background: Basic Probability Theory)
The t-Test
Pearsons Chi-Square Test (+ background: Distributions)
Likelihood Ratio Test (+ background: Conditional Probability)
Fishers Exact Test
2 Verb Subcategorisation
Precision and Recall
3 Semantic Similarity
Latent Semantic Indexing
4 Register Analysis
5 References
6 6 /6 7
References
Douglas Biber. Using Register-Diversifi ed Corpora for General
Language Studies. Computational Linguistics, 19(2):219241, 1993.
Michael Brent. From grammar to lexicon: Unsupervised learning of
lexical syntax. Computational Linguistics, 19(2):243262, 1993.
Ted Dunning. Accurate Methods for the Statistics of Surprise andCoincidence.Computational Linguistics, 19(1):6174, 1993.
Brigitte Krenn and Christer Samuelsson. The Linguists Guide to
Statistics: DON T PANIC. URL
http://coli.uni-sb.de/christer. Version of December 19,
1997, 1997.
Christopher Manning and Hinrich Schutze.Foundations of Statistical
Natural Language Processing. The MIT Press, Cambridge, MA,
USA, 1999.
Ted Pedersen. Fishing for Exactness. InProceedings of the
South-Central SAS Users Group Conference, Austin, TX, USA,1996.
67/67