Statistika za filologe

download Statistika za filologe

of 17

Transcript of Statistika za filologe

  • 8/21/2019 Statistika za filologe

    1/17

    Statistics for Linguists: A Tutorial

    Mark Dras

    Centre for Language TechnologyMacquarie University

    HCSNet Summerfest

    28 November 2006

    1 /6 7

    Tutorial Structure

    primarily as some examples of tasks linguists might be interested

    in

    within these, statistical ideas that are useful

    hypothesis testingvarious statistical measures (2, likelihood ratios, .. . )statistical distributionssome other useful ideas (e.g. Latent Semantic Analysis)

    basic material taken from Manning and Schutze [1999]

    another useful overview: Krenn and Samuelsson [1997]

    2 / 6 7

    Collocations

    Outline

    1 Collocations

    Frequency

    Hypothesis Testing (+ background: Basic Probability Theory)

    The t-TestPearsons Chi-Square Test (+ background: Distributions)

    Likelihood Ratio Test (+ background: Conditional Probability)

    Fishers Exact Test

    2 Verb Subcategorisation

    Precision and Recall

    3 Semantic Similarity

    Latent Semantic Indexing

    4 Register Analysis

    5 References

    3 /6 7

    Collocations

    Definitions

    collocation: an expression consisting of two or more words that

    correspond to some conventional way of saying things

    Firth (1957): Collocations of a given word are statements of the

    habitual or customary places of that word

    examples:

    noun phrases: strong tea,weapons of mass destructionphrasal verbs: to make upother standard phrases: the rich and powerful

    has limited compositionality

    more than an idiomexample:international best practice

    4 / 6 7

  • 8/21/2019 Statistika za filologe

    2/17

    Collocations Frequency

    Frequency

    most basic idea: start with a corpus, and count the relevant

    frequencies

    if looking for two-word collocations, just count frequencies of pairsof adjacent wordsobvious problem: get lots of useless high frequency wordsfrom New York Times:

    C(w1w2) w1 w280871 of the

    58841 in the

    26430 to the

    21842 on the

    . . . . . . . . .

    1 26 22 f ro m t he

    1 14 28 N ew Yo rk

    10007 he said

    . . . . . . . . .

    5 /6 7

    Collocations Frequency

    Frequency

    basic idea of frequency still maybe OK with conditions:

    1 use when looking to verify specific alternatives or patterns; or2 add a filter

    6 / 6 7

    Collocations Frequency

    Example #1a: Eggcorns

    described in Language Log

    http://itre.cis.upenn.edu/myl/languagelog/

    idea: something like a mistaken but plausible reanalysis of a word

    or phrase

    examples: to step foot in,baited breath,free reign,hone in,ripe

    with mistakes,for all intensive purposes,manner from Heaven,

    give up the goat

    like folk etymology, a malapropism, or a mondegreen; but not

    collection by Chris Waigl

    http:www.eggcorns.lascribe.net

    interested in seeing whether eggcorn is gaining currency

    compare by Google hitse.g.inclement weather(173K whG) vs inclimate weather(11KwhG) vsincliment weather(719 whG)

    7 /6 7

    Collocations Frequency

    Example #1b: Snowclones

    also defi ned at Language Log

    idea: adaptable cliche frames

    examples: Have X, will travel;X is the new Y;X, we have a

    problem

    again, use Google hits

    8 / 6 7

  • 8/21/2019 Statistika za filologe

    3/17

    Collocations Frequency

    Aside: Google Counts

    theres been discussion about the reliability of Google-derived

    frequencies

    see Jean Veronis blog and Language Log

    example of problem

    Google query: junco partner lyrics (9440 whG)Google query: junco partner lyrics connick (279 whG)

    Google query: junco partner lyrics -connick (930 whG)frequency counts over 100K generally regarded as unreliable, but

    may also be the case for smaller

    problems appear to be related to Googles indexing, and treatment

    of near-identical page matches

    9 /6 7

    Collocations Frequency

    Adding Filters

    alternatively, if the problem is to fi nd rather than to verify, can use

    fi lters based on part of speech

    for instance, in previous example of extracting collocations, can

    use patterns like the following:

    Tag pattern Example

    A N linear function

    N N r egr es si on co ef fi c ie nt sA A N Gaussian random variable

    N P N d eg re es o f f re ed om

    . . . . . .

    1 0 /6 7

    Collocations Frequency

    Adding Filters

    applied to same New York Times text, get

    C(w1w2) w1 w2 Tag Pattern11487 New York A N

    7261 United States A N5412 Los Angeles N N

    3301 last year A N. . . . . . . . . . . .

    1074 chief executive A N1 07 3 r ea l e st at e A N

    . . . . . . . . . . . .

    similarly, given particular adjectives, can fi nd most frequent

    co-occurring nouns:

    w C(strong, w) w C(powerful, w)support 50 force 13safety 22 computers 10sales 21 position 8. . . . . . . . . . . .man 9 man 8. . . . . . . . . . . .

    good, but still not perfect

    e.g.manoccurring in both listswant to ignore ifmanis relatively common by itself

    11/67

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Random Variable

    the probability of an event is the likelihood that it will occur,

    represented by a number between 0 and 1:

    probability 0: impossibility

    probability 1: certaintyprobability 0.5: equally likely to occur as not

    a random variable ranges over all the possible types of outcomes

    for the event being measured . . . example:

    random variableX= the result of rolling a dieP(X= 1)= the probability of the die showing 1 = 1/6P(X= 1) = P(X =2) = . . .= P(X =6)= 1/6

    properties

    the probability of an outcome is always between 0 and 1the sum of probabilities of all outcomes is 1

    1 2 /6 7

    C ll ti H th i T ti ( b k d B i P b bilit Th ) C ll ti H th i T ti ( b k d B i P b bilit Th )

  • 8/21/2019 Statistika za filologe

    4/17

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Summary Measures for Random Variables

    the (arithmetic) mean, or average:

    E(X) =i

    xiP(xi)

    from the die example,E(X) = 1.(1/6) + 2.(1/6) +. . .+ 6.(1/6) = 3.5

    the variance, to measure the spread:

    Var(X) = E(XE(X))2 =E(X2)(E(X))2 =i

    x2i P(xi)(E(X))2

    from the die example,Var(X) = 1.(1/6) + 4.(1/6) +. . .+ 36.(1/6) 3.52 =2.91

    note that these apply to the whole population

    13/67

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Example

    imagine a six-sided die where each outcome wasnt equally likely,

    but hadP(X=1) = P(X=6) = 1/100,P(X=2) = P(X =5) = 4/100,P(X=3) = P(X =4) = 45/100

    E(X) = 1.(1/100) + 2.(4/100) +. . .+ 6.(1/100) = 3.5, as beforeVar(X) = 1.(1/100) + 4.(4/100) +. . .+ 36.(1/100) 3.52 =0.53

    1 4 /6 7

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Estimating Probabilities

    Maximum Likelihood Estimator (MLE)

    used to estimate the theoretical probability from a sampleif a specific event has occurred mtimes out ofnoccasions, the

    MLE probability ism/nthe larger the number of occasions measured, the more accuratethe MLE

    sample mean and variance given observationsxi

    sample mean

    x= 1

    n

    ni=1

    xi

    sample variance

    s2

    =n

    i=1(xi x)2

    n 1

    15/67

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Estimating Probabilities

    imagine we dont know the population probabilities

    we want to estimate them from a sample

    die outcome number of times rolled

    1 162 183 134 165 196 18

    100

    x= 16 1+. . .+ 18 6

    100 =3.58

    s2 = 16 (1 3.58)2 +. . .+ 18 (6 3.58)2

    99 3.05

    1 6 /6 7

    Collocations Hypothesis Testing (+ background: Basic Probability Theory) Collocations Hypothesis Testing (+ background: Basic Probability Theory)

  • 8/21/2019 Statistika za filologe

    5/17

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Estimating Probabilities

    problem: sparse data

    its more difficult to estimate the probability of a rare eventif the corpus doesnt register, say, a rare word, the MLE for the wordis 0

    possible solution: smoothing

    even though the MLE seems like the right way of estimatingprobabilities, its not always the right oneinfrequent events can get too little probability masscan redistribute some of the probabilities

    17/67

    Collocations Hypothesis Testing (+ background: Basic Probability Theory)

    Hypothesis Testing

    using frequencies, as previously, we might decide that new

    companiesis a collocation because if has a high frequency

    however, we dont really think it is one; maybe its becausenewandcompaniesare individually frequent, and just appear together bychance

    hypothesis testing is a way of assessing whether something is due

    to chance

    has the following procedure:

    formulate a NULL HYPOTHESISH0, that there is no associationbeyond chancecalculate the probabilitypthat the event would occur if H0were truethen, ifpis too low (usually 0.05 or smaller), reject H0; retain it as apossibility otherwise

    1 8 /6 7

    Collocations The t-Test

    The t-Test

    want a test that will say how likely or unlikely a certain event is to

    occur

    the t-test compares a sample mean with a population mean,

    relative to the samples variability

    t= x

    s2

    N

    whereNis the size of the sample, and is the population meanaccording to the null hypothesis

    look up this t-value against a table

    table gives t-value for a given confidence level and a givennumber of degrees of freedom ( = N 1)

    p 0 .0 5 0 .0 1 0. 00 5 0 .0 01

    d.f. 1 6.314 31.82 63.66 318.31 0 1 .81 2 2 .7 64 3. 16 9 4 .1 442 0 1 .72 5 2 .5 28 2. 84 5 3 .5 52 1 .6 45 2 .3 26 2 .5 76 3 .0 91

    19/67

    Collocations The t-Test

    The t-Test

    non-linguistic example:

    H0: the mean height of a population of men is 158cm (vs apopulation of shorter men)

    sample data: size 200 with x= 169 ands2

    =2600

    t= 169 158

    2600

    200

    3.05

    looking up the table, t>2.576, so we can reject H0 with 99.5%confi dence

    p 0 .0 5 0. 01 0.005 0.001d.f. 1 6.314 31.82 63.66 318.3

    10 1 .8 12 2. 76 4 3 .1 69 4 .1 4420 1 .7 25 2. 52 8 2 .8 45 3 .5 52 1.645 2.326 2.576 3.091

    2 0 /6 7

    Collocations Pearsons Chi-Square Test (+ background: Distributions) Collocations Pearsons Chi-Square Test (+ background: Distributions)

  • 8/21/2019 Statistika za filologe

    6/17

    Collocations Pearson s Chi Square Test (+ background: Distributions)

    Distributions

    a PROBABILITY DISTRIBUTION FUNCTION is a function describing

    the mapping from random variable values to probabilities

    these can be either discrete (from a finite set) or continuous

    weve already seen a UNIFORM DISTRIBUTION (the original die

    example)

    this was a discrete function

    P(X =x) = 1

    n(wherenis the number of outcomes)

    21/67

    Collocations Pearson s Chi Square Test (+ background: Distributions)

    Gaussian Distribution

    another important (continuous) one is the G AUSSIAN( OR

    NORMAL) DISTRIBUTION

    defined by the function f(x) = 1

    2

    exp(x)222

    population mean is, variance

    a lot of data can be assumed to have this distribution, e.g. heights

    in a population

    the t-test described previously assumes a normal distribution

    2 2 /6 7

    Collocations Pearsons Chi-Square Test (+ background: Distributions)

    Bernoulli Distribution

    the discrete BERNOULLI DISTRIBUTIONmeasures the probability

    of success in a yes/no experiment, with this probability called p

    defined byP(X =1) = p, P(X =0) = 1 p

    population mean isp, variance isp(1 p)

    23/67

    Collocations Pearsons Chi-Square Test (+ background: Distributions)

    Binomial Distribution

    the discrete BINOMIAL DISTRIBUTION measures the probability of

    the number of successes in a sequence of nindependent yes/no

    experiments (Bernoulli distributions), each of which has probability

    pdefined byP(X= x) =

    nk

    pk(1 p)nk

    population mean isnp, variance isnp(1 p)

    models things like the probability of gettingkheads fromntossesof a fair coinfor largen, can be approximated by the normal distribution 2 4 /6 7

  • 8/21/2019 Statistika za filologe

    7/17

  • 8/21/2019 Statistika za filologe

    8/17

    Collocations Pearsons Chi-Square Test (+ background: Distributions) Collocations Pearsons Chi-Square Test (+ background: Distributions)

  • 8/21/2019 Statistika za filologe

    9/17

    Pearsons Chi-Square Test

    as for the t-test, the 2 has an associated number of degrees offreedom

    for a table of dimensions r c, there are(r 1)(c 1)degrees offreedom

    we check the distribution for 2:

    p 0.99 0.95 0.10 0.05 0 .0 1 0 .0 05 0 .0 01

    d.f. 1 0.00016 0.0039 2.71 3.84 6 .6 3 7. 88 10 .8 32 0.20 0.10 4.60 5.99 9.21 10.60 13.823 0.115 0.35 6.25 7.81 11.34 12.84 16.274 0.297 0.71 7.78 9.49 13.28 14.86 18.47

    100 70.06 77.93 118.5 124.3 135.8 140.2 149.4

    theX2 value is less than for = 0.05, so we wouldnt rejectH0:i.e. we wouldnt takenew companiesas a collocation, as before

    with the t-test

    33/67

    Comparison: Chi-Square vs t-Test

    for the previous example, theres quite a lot of overlap

    for example, the top 20 bigrams according to the t-test are the sameas the top 20 for2

    however,2 also appropriate for large probabilities, where thenormality assumption of the t-test fails

    3 4 /6 7

    Collocations Likelihood Ratio Test (+ background: Conditional Probability)

    Conditional Probability

    weve already in fact used the notion of independent events

    two events are independent of each other if the occurrence of onedoes not affect the probability of the occurrence of the other

    tossing a coin and winning the lottery: independentspeeding and having an accident: not independent

    conditional probability: the probability that one event occurs given

    that another event occurs

    35/67

    Collocations Likelihood Ratio Test (+ background: Conditional Probability)

    Example

    the following table shows the weather conditions for 100 horse

    races and how many times Harry won:

    rain shine

    win 15 5

    no win 15 65

    Harry won 20 out of 100 races: P(win) = 0.2 (by MLE)

    the conditional probability of Harry winning given rain is

    P(win| rain) = 15/30= 0.5

    compare this with the2 test: under the null hypothesis, theobserved data was compared against the situation where the

    words were independent

    3 6 /6 7

    Collocations Likelihood Ratio Test (+ background: Conditional Probability)

    Lik lih d R iCollocations Likelihood Ratio Test (+ background: Conditional Probability)

    Lik lih d R i

  • 8/21/2019 Statistika za filologe

    10/17

    Likelihood Ratio

    another approach to hypothesis testing

    more appropriate to sparse data than 2

    more interpretable also: says how much more likely one hypothesisis than another

    here, examine explicitly two hypotheses to explain bigramw1w2

    Hypothesis 1: P(w2 |w1) = p= P(w2|w1)Hypothesis 2: P(w2 |w1) = p1=p2= P(w2|w1)

    Hypothesis 1 represents independence ofw1 and w2; Hypothesis

    2 represents dependence (and hence a possible collocation)

    37/67

    Likelihood Ratio

    well use the usual MLEs for p,p1,p2, and writingc1, c2,c12 for thenumber of occurrences ofw1,w2,w1w2

    p= x2

    N, p1=

    c12

    c1,p2 =

    c2 c12N c1

    well also use the notation for a binomial distribution

    b(k; n,p) =n

    k

    pk

    (1 p)nk

    now, the likelihoods are

    for Hypothesis 1,

    L(H1) = b(c12; c1, p)b(c2 c12; N c1,p)

    for Hypothesis 2,

    L(H2) = b(c12; c1, p1)b(c2 c12; N c1,p2)

    3 8 /6 7

    Collocations Likelihood Ratio Test (+ background: Conditional Probability)

    Likelihood Ratio

    the log of the likelihood ratio is then

    log= logL(H1)

    L(H2)

    for bigrams ofpowerful:

    2log C(w1) C(w2) C(w1w2) w1 w21291.42 1 2593 932 150 most powerful

    99.31 379 932 10 p olitically p owerful82.96 932 934 10 p owerful computers80.39 932 3424 13 powerful force57.27 932 291 6 powerful symbol

    . . . . . . . . . . . . . . . . . .

    the value 2loghas a2 distribution

    so you can do hypothesis testing using the2 table

    39/67

    Collocations Likelihood Ratio Test (+ background: Conditional Probability)

    Comparison: Chi-Square and Likelihood Ratio

    the likelihood ratio has an intuitive meaning

    from the previous table, the bigram powerful symbolise0.557.27 2.729 1012 times more likely to occur than would be

    expected by the individual words alonecomparison carried out by Dunning [1993]

    2 tends to be less accurate with sparse data

    as a rule of thumb, need large sample, and counts in each cell (i.e.

    occurrences of words or bigrams) of at least 5

    the events were interested in textindividual words orn-gramsare in fact often less frequent than this: related to theZipfian distribution of wordsas an example, Dunning selected words from a 500,000 wordcorpus with frequences of between 1 and 4; these included wordslike abandonment,clause,meat,poiand understatement

    the log likelihood is more accurate here (but still needs counts of atleast 1)

    4 0 /6 7

    Collocations Fishers Exact Test

    Fi h E t T tCollocations Fishers Exact Test

    Fi h E t T t

  • 8/21/2019 Statistika za filologe

    11/17

    Fishers Exact Test

    the previous tests have all been PARAMETRICtests: that is, they

    assume some distribution

    its possible to use a N ON -PARAMETRICtest, which makes no

    assumptions

    trade-off is that its typically more time-consuming to calculate,

    and is only feasible for smaller amounts of data

    Fishers Exact Test computes the signifi cance of an observed

    table by exhaustively computing the probability of every table that

    would have the same marginal totals

    suggested as an alternative to the previous tests by Pedersen

    [1996]

    41/67

    Fishers Exact Test

    consider again a 2 2 contingency table

    w1 = new w 1 =neww2 = companies E 1,1 E1,2

    (new companies) ( e.g.old companies)w2 =companies E 2,1 E2,2

    (e.g.new machines) (e.g.old machines)

    the probability of obtaining any such set of values is

    p=E1,1+E1,2

    E1,1E2,1+E2,2

    E2,1E1,1+E1,2+E2,1+E2,2

    E1,1+E2,1

    4 2 /6 7

    Verb Subcategorisation

    Outline

    1 Collocations

    Frequency

    Hypothesis Testing (+ background: Basic Probability Theory)

    The t-TestPearsons Chi-Square Test (+ background: Distributions)

    Likelihood Ratio Test (+ background: Conditional Probability)

    Fishers Exact Test

    2 Verb Subcategorisation

    Precision and Recall

    3 Semantic Similarity

    Latent Semantic Indexing

    4 Register Analysis

    5 References

    43/67

    Verb Subcategorisation

    Verb Subcategorisation

    verbs express their semantic arguments with different syntactic

    means

    the class of verbs with semantic arguments themeandrecipienthas a subcategory expressing these via a direct object and aprepositional phrase: he donated a large sum of money to thechurcha second subcategory permits double objects: he gave the churcha large sum of money

    these subcategorisation frames are typically not in dictionaries

    might be interested in identifying them via statistics

    Brent [1993] developed the system Lerner to assign one of six

    frames to verbsDescripti on G ood Example Bad ExampleNP only greet them *arr ive themtensed clause hope hell attend *want hell attendi nfi n it iv e h op e t o a tt en d * gr ee t t o a tt en d

    NP & clause tell him hes a fool *yell him hes a foolNP & infi nitive want him to attend *hope him to attendNP & NP t el l him t he story *shout him t he story

    4 4 /6 7

    Verb Subcategorisation

    Algorithm for Learning Subcat FramesVerb Subcategorisation

    Hypothesis Testing

  • 8/21/2019 Statistika za filologe

    12/17

    Algorithm for Learning Subcat Frames

    Lerner had two steps:

    1 Define cues. Define a regular pattern of words and syntacticcategories which indicates the presence of the frame with highcertainty (probability of error). For a particular cue cjwe define theprobability of errorjthat indicates how likely we are to make amistake if we assign framef to verbvbased on cuecj.

    2 Do hypothesis testing. Initially assume the frame is notappropriate for the verb: this is the null hypothesisH0. We rejectH0

    if the cuecjindicates with high probability that H0 is wrong

    example: cue for frame NP only (transitive verb)

    (OBJ| SUBJ OBJ| CAP) (PUNC| CC)OBJ = accusative case personal pronouns; SUBJ OBJ =nominative or accusative case personal pronouns; CAP =capitalised word; PUNC = punctuation; CC = subordinatingconjunctionpositive indicator for transitive verb: consider . . . greet/V Peter/CAP,/PUNC . . .

    45/67

    Hypothesis Testing

    suppose verbvioccurs a total ofntimes in the corpus and that

    there arem noccurrences with a cue for frame fj

    assume also some errorjin inferring a frame fjfrom cue cj

    this suggests a binomial distribution

    then reject null hypothesisH0thatvidoes not permit fjwith the

    following probability of error:

    pE=P(vidoes not permit fj |C(vi, cj) m) =n

    r=m

    n

    r

    rj(1j)

    nr

    various values forjwere assessed

    4 6 /6 7

    Verb Subcategorisation Precision and Recall

    Precision and Recall

    typically, when building a statistical model to do something, you

    want to evaluate how suitable it is

    for example, how well it performs a simple task

    the measures of PRECISIONand RECALLare one way of doing that

    imagine you have a system for sorting your objects of interest into

    two piles, relevant and irrelevant

    system can make two types of errors: classifying a relevant objectas irrelevant, or an irrelevant one as relevantsystem decisions can then be broken into four categories: truepositive (TP), false positive (FP), false negative (FN), true negative(TN)

    actually:system predicts: relevant irrelevantrelevant TP FPirrelevant FN TN

    47/67

    Verb Subcategorisation Precision and Recall

    Precision and Recall

    precision is the proportion of system-predicted relevant objects

    that are correct:

    PRE=

    TP

    TP+ FPrecall is the proportion of actually relevant objects that the system

    managed to predict as relevant

    REC= TP

    TP+ FNexample: theres a set of 200 documents, of which 40 are actually

    relevant; your system says that 50 are relevant, including 20 of the

    ones that actually are

    actually:system predicts: relevant irrelevant

    relevant TP = 20 FP = 30ir rel eva nt FN = 20 TN = 13 0

    then,PRE= 2050

    andREC= 2040 4 8 /6 7

    Verb Subcategorisation Precision and Recall

    Verb Subcategorisation: Lerner AccuracyVerb Subcategorisation Precision and Recall

    F Measure

  • 8/21/2019 Statistika za filologe

    13/17

    Verb Subcategorisation: Lerner Accuracy

    for Lerner, precision and recall values were calculated for various

    j

    this is the table for the tensed clause frame

    j TP FP TN FN MC %MC PRE REC

    .0312 13 0 30 20 20 32 1.00 .39

    .0156 19 0 30 14 14 22 1.00 .58

    .0078 22 1 29 11 12 19 .96 .67

    .0039 25 1 29 8 9 14 .96 .76

    .0020 27 3 27 6 9 14 .90 .82

    .0010 29 5 25 4 9 14 .85 .88

    .0005 31 8 22 2 10 16 .79 .94.0002 31 13 17 2 15 24 .70 .94

    .0001 33 19 11 0 19 30 .63 1.00

    MC is total misclassified

    49/67

    F-Measure

    theres typically a trade-off between precision and recall

    there are a number of ways of combining them into a single

    measure

    one is the F-measure, the weighted harmonic mean of the two:

    F =2 PRE REC

    PRE+ REC

    5 0 /6 7

    Semantic Similarity

    Outline

    1 Collocations

    Frequency

    Hypothesis Testing (+ background: Basic Probability Theory)

    The t-TestPearsons Chi-Square Test (+ background: Distributions)

    Likelihood Ratio Test (+ background: Conditional Probability)

    Fishers Exact Test

    2 Verb Subcategorisation

    Precision and Recall

    3 Semantic Similarity

    Latent Semantic Indexing

    4 Register Analysis

    5 References

    51/67

    Semantic Similarity

    Semantic Similarity

    there are a number of resources that group words together by

    semantic relatedness

    examples are thesauruses, Wordnet

    semantic relations are synonymy, hypernymy, etce.g. dogand caninemight be in a class together; this might be ahyponym of a class corresponding to animal

    you might want to automatically derive classes to capture relations

    for when you have a new unknown word: e.g. if inSusan had nevereaten a fresh durian beforeyou dont know what kind of thingdurianisif you want types of classes other than the standard ones

    5 2 /6 7

    Semantic Similarity Latent Semantic Indexing

    Latent Semantic Indexing (LSI)Semantic Similarity Latent Semantic Indexing

    Example

  • 8/21/2019 Statistika za filologe

    14/17

    Latent Semantic Indexing (LSI)

    in LSI, we look at the interaction of terms and documents

    the purpose of this interaction is twofold

    to have the documents tell us which terms should be groupedtogetherto have the grouped-together terms tell us about the similarity of thedocuments

    this interaction is described by a matrix, and the grouping is

    carried out by a process called Singular Value Decomposition(SVD)

    53/67

    Example

    say we have 5 terms of interestcosmonaut,astronaut,moon,

    car,truckand 6 documents

    we describe their interaction by a matrixA, where cellaijcontains

    the count of termi in documentj

    A=

    d1 d2 d3 d4 d5 d6cosmonaut 1 0 1 0 0 0

    astronaut 0 1 0 0 0 0moon 1 1 0 0 0 0

    car 1 0 0 1 1 0

    truck 0 0 0 1 0 1

    this can be thought of as a fi ve-dimensional space (defi ned by the

    terms) with six objects in that space (the documents)

    what we want to do is reduce the dimensions, thus grouping

    similar terms

    5 4 /6 7

    Semantic Similarity Latent Semantic Indexing

    Dimensionality Reduction

    there are many possible types of dimensionality reduction

    LSI chooses the mapping that means that the reduced dimensions

    correspond to the greatest axes of variation

    that is, if the new dimensions are numbered 1 . . . k, dimension 1captures the greatest amount of commonality, dimension 2 thesecond greatest, and so on

    this process is carried out by the matrix operation called Singular

    Value Decomposition

    here, the term-by-document matrix A tdis decomposed into three

    other matrices

    Atd=TtnSnn(Ddn)T

    this decomposition is (almost) unique

    55/67

    Semantic Similarity Latent Semantic Indexing

    Example

    T =

    Dim 1 Dim 2 Dim 3 Dim 4 Dim 5

    cosmonaut 0.44 0.30 0.57 0.58 0.25

    astronaut 0.13 0.33 0.59 0.00 0.73moon 0.48 0.51 0.37 0.00 0.61car 0.70 0.35 0.15 0.58 0.16truck 0.26 0.65 0.41 0.58 0.09

    consider the columns .. .

    5 6 /6 7

    Semantic Similarity Latent Semantic Indexing

    ExampleSemantic Similarity Latent Semantic Indexing

    Example

  • 8/21/2019 Statistika za filologe

    15/17

    Example

    S=

    2.16 0.00 0.00 0.00 0.000.00 1.59 0.00 0.00 0.000.00 0.00 1.28 0.00 0.000.00 0.00 0.00 1.00 0.000.00 0.00 0.00 0.00 0.39

    this matrix embodies the weight of the dimensions

    it always goes largest to smallest

    57/67

    Example

    DT =

    d1 d2 d3 d4 d5 d6Dim 1 0.75 0.28 0.20 0.45 0.33 0.12Dim 2 0.29 0.53 0.19 0.63 0.22 0.41Dim 3 0.28 0.75 0.45 0.20 0.12 0.33Dim 4 0.00 0.00 0.58 0.00 0.58 0.58Dim 5 0.53 0.29 0.63 0.19 0.41 0.22

    5 8 /6 7

    Semantic Similarity Latent Semantic Indexing

    Example

    so far weve just transformed the dimensions; now to reduce

    for this example, decide to reduce to 2 dimensions

    to look at documents, combine this reduced dimensionality with

    the weighting of the dimensions

    derive new matrixB2d =S22DT2d

    B=

    d1 d2 d3 d4 d5 d6Dim 1 1.62 0.60 0.44 0.97 0.70 0.26Dim 2 0.46 0.84 0.30 1.00 0.35 0.65

    59/67

    Semantic Similarity Latent Semantic Indexing

    Conceptually . . .

    can imagine that the terms are made up of semantic particles

    perhaps along the lines of Wierzbickas semantic primitiveshowever, not defined a priori; only a consequence of the relations inthe given set of documents

    LSI rearranges things so that the terms with greatest number of

    semantic particles in common are grouped

    6 0 /6 7

    Register Analysis

    OutlineRegister Analysis

    Register Analysis

  • 8/21/2019 Statistika za filologe

    16/17

    Outline

    1 Collocations

    Frequency

    Hypothesis Testing (+ background: Basic Probability Theory)

    The t-Test

    Pearsons Chi-Square Test (+ background: Distributions)

    Likelihood Ratio Test (+ background: Conditional Probability)

    Fishers Exact Test

    2 Verb Subcategorisation

    Precision and Recall

    3 Semantic Similarity

    Latent Semantic Indexing

    4 Register Analysis

    5 References

    61/67

    Register Analysis

    work done by Douglas Biber

    these notes from Biber [1993]

    idea: different registers have systematic patterns of variation

    e.g. professional letters vs academic prosecan do descriptive analyses, based on frequency or proportions ofselected characteristicshowever, may also want to identify groups of characteristics

    distinguishing registers

    6 2 /6 7

    Register Analysis

    Descriptive Analysis

    example: from Brown corpus, mean frequencies of three

    dependent clause types (per 1000 words)

    relative causative adverbial that complementr eg is te r c la us es s ub or di na te c la us es c la us espress reports 4.6 0.5 3.4offi cial documents 8.6 0.1 1.6conversations 2.9 3.5 4.1prepared speeches 7.9 1.6 7.6

    from this, can see e.g. that relative clauses are common in offi cial

    documents and prepared speeches relative to conversation

    may be interested in grouping many of these characteristics of text

    together

    63/67

    Register Analysis

    Dimension Identification

    Biber carried out a quantitative analysis of 67 linguistic features in

    the LOB and London-Lund corpora

    features included: tense and aspect markers, place and timeadverbials, pronouns and pro-verbs, nominal forms, prepositionalphrases, adjectives, lexical specificity, lexical classes (e.g. hedges,emphatics), mmodals, specialised verb classes, reduced forms anddiscontinuous structures, passives, stative forms, dependentclauses, coordination, and questionsfrequencies were counted, and normalised to per-1000 values

    then, FACTOR A NALYSIS was carried out

    this is a dimensionality reduction procedure very similar to LSIthe dimensions similarly end up in decreasing order of explanatorypower

    6 4 /6 7

    Register Analysis

    Dimension IdentificationReferences

    Outline

  • 8/21/2019 Statistika za filologe

    17/17

    after inspecting the results of the factor analysis, Biber

    interpretively labelled the fi rst fi ve dimensions:

    1 Informational vs Involved Production2 Narrative vs Nonnarrative Concerns3 Elaborated vs Situation-Dependent Reference4 Overt Expression of Persuasion5 Abstract vs Nonabstract Style

    example of features associated with Dimension 1:f un ctio ns lin gu is tic fe atu re s ch ar act er is tic r egi st er sM on ol og ue n ou ns , a dj ec ti ve s i nfo rm at io na l ex po si ti onCareful Production prepositionalphrases e.g. offi cial documentsInformational long words academic proseFacelessI nt eracti ve 1st and 2nd person pronouns conversat ions(Inter)personalFocus questions, reductions (personalletters)I nvol ved st ane ver bs, h ed ges ( publ ic c onve rs atio ns )Personal Stance emphati csOn-Line Production adverbial subordination

    65/67

    1 Collocations

    Frequency

    Hypothesis Testing (+ background: Basic Probability Theory)

    The t-Test

    Pearsons Chi-Square Test (+ background: Distributions)

    Likelihood Ratio Test (+ background: Conditional Probability)

    Fishers Exact Test

    2 Verb Subcategorisation

    Precision and Recall

    3 Semantic Similarity

    Latent Semantic Indexing

    4 Register Analysis

    5 References

    6 6 /6 7

    References

    Douglas Biber. Using Register-Diversifi ed Corpora for General

    Language Studies. Computational Linguistics, 19(2):219241, 1993.

    Michael Brent. From grammar to lexicon: Unsupervised learning of

    lexical syntax. Computational Linguistics, 19(2):243262, 1993.

    Ted Dunning. Accurate Methods for the Statistics of Surprise andCoincidence.Computational Linguistics, 19(1):6174, 1993.

    Brigitte Krenn and Christer Samuelsson. The Linguists Guide to

    Statistics: DON T PANIC. URL

    http://coli.uni-sb.de/christer. Version of December 19,

    1997, 1997.

    Christopher Manning and Hinrich Schutze.Foundations of Statistical

    Natural Language Processing. The MIT Press, Cambridge, MA,

    USA, 1999.

    Ted Pedersen. Fishing for Exactness. InProceedings of the

    South-Central SAS Users Group Conference, Austin, TX, USA,1996.

    67/67