Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 4

Albert GattCorpora and Statistical MethodsProbability distributionsPart 2Example 1: Book publishingCase:publishing house considers whether to publish a new textbook on statistical NLPconsiderations include: production cost, expected sales, net profits (given cost)Problem:to publish or not to publish?depends on expected sales and profitsif published, how many copies?depends on demand and cost

Example 1: Demand & cost figuresSuppose:book costs 35, of which:publisher gets 25bookstore gets 6author gets 4To make a decision, publisher needs to estimate profits as a function of the probability of selling n books, for different values of n.profit = (25 * n) overall production cost

TerminologyRandom variableIn this example, the expected profit from selling n books is our random variableIt takes on different values, depending on n We use uppercase (e.g. X) to denote the random variable

DistributionThe different values of X (denoted x) form a distribution.If each value x can be assigned a probability (the probability of making a given profit), then we can plot each value x against its likelihood.DefinitionsRandom variableA variable whose numerical value is determined by chance. Formally, a function that returns a unique numerical value determined by the outcome of an uncertain situation.Can be discrete (our exclusive focus) or continuous

Probability distributionFor a discrete random variable X, the probability distribution p(x) gives the probabilities for each value x of X.The probabilities p(x) of all possible values of X sum to 1.The distribution tells us how much out of the overall probability space (the probability mass), each value of x takes up.

Tabulated probability distributionNo. copies soldProd. costProfits(X)ProbabilityP(x)5,000275,000-150,000.2010,000300,000-50,000.4020,000350,000150,000.2530,000400,000350,000.1040,000450,000550,000.05Plotting the distribution

Uses of a probability distributionComputation of:mean: the expected value of X in the long runbased on the specific values of X, and their probabilityNB: NOT interpreted as value in a sample of data, but expected (future) value based on sample.standard deviation & variance: the extent to which actual values of X will differ from the meanskewness: the extent to which our distribution is balanced, i.e. whether its symmetrical

In graphics

Mean: expected valuein the long runSD & variance:How much actual valuesdeviate from mean overallSkewness:Symmetry or tail of our distributionMeasures of expectation and variationThe expected value (mean)The expected value of a discrete random variable X, denoted E[X] or , is a weighted average of the values of Xweighted, because not all values x will have the same probabilityestimated by summing, for all values of X, the product of x and its probability p(x)

More on expected valueThe mean or expected value tells us that, in the long run, we can expect X to have the value .E.g. in our example, our book publisher can expect long-term profits of:(-150,000 * .2) + (-50,000 * .4) + (150,000 * .25) + (350,000 * .1) +(550,000 * .05) = 50,000VarianceMean is the expected value of X, E[X]Variance (2) reflects the extent to which the actual outcomes deviate from expectation (i.e. from E[X])2 = E[(X )2] = (x )2p(x)i.e. the weighted sum of deviations squaredReasons for squaring:eliminates the distinction between +ve and vemakes it exponential: larger deviations are given more importancee.g. one deviation of 10 is as large as 4 deviations of 5Standard deviationVariance gives overall dispersion or variationStandard deviation () is the dispersion of possible outcomes; it indicates how spread out the distribution is.estimated as square root of variance

The book publishing example againRecall that for our new book on stat NLP, expected profit is 50,000Whats the standard deviation?need to estimate (50000-x)2 for all xmultiply by p(x) in each casetake the square root of the resultThis is left as an exercise

SkewnessThe mean gives us the centre of a distribution.Standard deviation gives us dispersion.Skewness (denoted gamma) is a measure of the symmetry of the outcomes.

Skewness, continuedThe formula calculates the average value of cubed deviations by the standard deviation cubed.

Why cubed?The cube of a positive deviation is itself positive; that of a negative is itself negative. We want both, as we want to know deviations both to the left (-ve) and right (+ve) of the mean.Like the variance estimation, this emphasises large deviations in either direction (its exponential).If the outcomes are symmetrical around the mean, then +ve and ve deviations are balanced, and skewness is 0.

Graphical display of skewnessPositive skewness:tail going right

Negative skewness:tail going leftSkewness and languageBy Zipfs law (next week), word frequencies do not cluster around the mean.There are a few highly frequent words (making up a large proportion of overall word frequency)There are many highly infrequent words (f = 1 or f = 2)So the Zipfian distribution is highly skewed. We will hear more on the Zipfian distribution in the next lecture.The concept of informationWhat is information?Main ingredient:an information source, which transmits symbols from a finite alphabet S every symbol is denoted siwe call a sequence of such symbols a textassume a probability distribution s.t. every si has probability p(si)

Example:a dice is an information source; every throw yields a symbol from the alphabet {1,2,3,4,5,6}6 successive throws yield a text of 6 symbolsQuantifying informationIntuition:the more probable a symbol is, the less information it yieldssomething seen very often is not very surprisingSo information is the inverse probability of the symbol

for some b > 1. Usually we use base 2Another term for I(s) is surprisal

Properties of INon-negative

If p(s) = 1, I(s) = 0

If 2 events s1, s2 are independent, then:

Monotonic: slight changes in probability result in slight changes in I

Aggregate measure of informationWhat is the information content of a text (sequence of symbols)?this is the same as finding the average information of a random variablethe measure is called Entropy, denoted HDefine X as a random variable over the symbols in our alphabetP(s) = P(X=s) for all s in our alphabetEstimate H(P)

EntropyThe entropy (or information) of a probability distribution is

entropy is the expected value (mean) of the surprisal the value is interpreted as the number of bits of information

26Entropy exampleSource = an 8-sided dieAlphabet S = {1,2,3,4,5,6,7,8}every si has p = 1/8

Interpretation of entropyThe information contained in the distribution P (the more unpredictable the outcomes, the higher the entropy)

The message length if the message was generated according to P and coded optimally28Interpretation cont/dFor the 8-sided die example, the result H(P)=3 tells us we need 3 bits on average to transmit the result of rolling an 8-sided die:

We cant do it in less than 3 bits

12345678001010011100101110111000Entropy for multiple variablesSo far we have dealt with a single random variableThe joint entropy of a pair of RVs:

30Conditional EntropyGiven X and Y, how much information about Y do we gain if we know X?a version of entropy using conditional probability: H(Y|X)

31Mutual informationMutual informationJust as probability can change based on posterior knowledge, so can information.Suppose our distribution gives us the probability P(a) of observing the symbol a.Suppose we first observe the symbol b. If a and b are not independent, this should alter our information state with respect to the probability of observing a.i.e. we can compute p(a|b)Mutual info between two symbolsThe change in our information about a on observing b is:

If a and b are completely independent, I(a;b)=0.

Averaging mutual informationWe want to average mutual information between all values of a random variable A and those of a random variable B.

And similarly:

Combining the two

Thus, mutual info involves taking the joint probability and dividing by the individual probabilities.I.e. a comparison of the likelihood of observing a, b together vs. separately.Mutual Information: summaryGives a measure of reduction in uncertainty about a random variable X, given knowledge of Y

quantifies how much information about X is contained in Y

Some more on I(X;Y)In statistical NLP, we often calculate pointwise mutual information this is the mutual information between two points on a distributionI(x;y) rather than I(X;Y)used for some applications in lexical acquisition

Mutual Information -- exampleSuppose were interested in the collocational strength of two words x and ye.g. bread and buttermutual information quantifies the likelihood of observing x and y together (in some window)If there is no interesting relationship, knowing about bread tells us nothing about the likelihood of encountering butterHere, P(x,y) = P(x)P(y) and I(x;y) = 0This is the Church and Hanks (1991) approach. NB. The approach uses pointwise MI

Corpora and Statistical Methods

Documents

Transcript of Corpora and Statistical Methods