INFORMATION THEORY POLYNESIAN REVISITED Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.

19
INFORMATION THEORY POLYNESIAN REVISITED Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics

Transcript of INFORMATION THEORY POLYNESIAN REVISITED Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.

Information TheoryPolynesian RevisitedThomas Tiahrt, MA, PhDCSC492 Advanced Text Analytics

Hello and Welcome to CSC 492 Advanced Text Analytics. We continue our overview of information theory by revisiting simplified Polynesian.

1Models vs. Reality2Simplified Polynesian:Not actually a random variableBut can be modeled by a random variable

When we use statistics to represent a phenomena we always keep in mind that we are simplifying the phenomena in order to work with it. We create a model, but the model is not reality. Recall the George E.P. Box adage that all models are false, some models are useful. We want to work with useful models.

In our earlier session, we approximated the simplified Polynesian language by assuming that we can model it as a random variable. It wont be a complete representation of reality, but it should be good enough for the purposes we want to pursue.2Models vs. Reality3

Suppose that we are provided with new information. Linguists living among Polynesians have discovered that Simplified Polynesian has a syllable structure, and that syllable structure is always a consonant followed by a vowel. This new information allows us to construct a better model using syllables than we had using just letters alone.3Polynesian Syllable Model4

Given that we know that all syllables are consonant-vowel sequences we can model the language with 2 random variables. We have a joint distribution and marginal distributions with our new model.4Joint and Marginal Distributions5

In the upper table we have the joint distribution in the intersection of each letter pair, and the marginal distributions in the margins. The bottom table compares the per letter probabilities to the per syllable probabilities, noting that the per syllable probabilities are marginal probabilities. Because the marginal probabilities are on a per-syllable basis, they are double the per-letter probabilities. We must keep that doubling factor in mind when we get to our model-to-model comparison.5Joint Entropy6

Recall that we derived equation 15 in our last session. Now we want to use that result.6Joint Entropy7

Because equation 15 is applicable to our new model of Polynesian, we just need to substitute our consonant and vowel notation for the generic S and T notation. We will use that previous result in our entropy calculation.78

The first of our two values is the entropy of the probabilities of the consonants. We use the marginal probabilities to compute the entropy.89

On this slide we are just finishing the calculation we began on the previous slide.910

For the second component of our syllable model with need the entropy of the conditional probability of the vowels given the probabilities of the consonants.

Equation 16 is just equation 13 with the consonant and vowel set identifiers. We want to ensure that we verify where all the numbers that go into our calculation come from. We will compute the components of equation 16 separately so that it is easier to follow the computation. The table will aid us as a handy reference to the probabilities we need.1011

Before we can compute the log of the vowel probabilities given the consonant probabilities we need to have those vowel probabilities given the consonant probabilities. We show the summations here, but of course they are just the marginal probabilities of each consonant. 1112

Next we use those marginal probabilities as the denominators to calculate each of the vowel probabilities given the consonant probabilities. We use the nine probabilities from the Cartesian product of the consonant set C and the vowel set V to compute the conditional probabilities of the vowels given the probabilities of the consonants. We take those probabilities and place them in the table for easy reference in our next step. 1213

Next we simply take the log of each value. Note that this is log base 2 value even though the 2 is not shown with the log operator here. 1314

We perform the multiplication and add up the results to obtain the second of our two components of the entropy of our Simplified Polynesian Language syllable model.14Joint Entropy15

At last we make the final addition for the entropy calculation. We find that the entropy is 2.43625 bits per syllable. To compare that to our per letter model we must multiply the per letter entropy by two. This is because we have two-letter syllables. The reason for the reduction in entropy is that our new model reduces uncertainty. The reduction in uncertainty means that, on average, we are less surprised by Polynesian that we were before when we used the per-letter model. 15Entropy Rate16

The information amount or quantity conveyed in a message will depend on the message length. A longer message will, on average, have more information than a shorter message will have. Consequently we want to use a per unit value, where units may be letters or may be words. Here the 1n subscript refers to the fact that this is a per unit measure.16Entropy of Human Language17

We assume that human language is a stochastic process consisting of a token sequence. Imagine that we have a Web crawler that continually collects new samples of a language. As we collect more and more data the entropy of the language approaches a limit, which we use as the our entropy estimate for the language.17References18Sources:Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich SchtzeThe MIT PressFundamentals of Information Theory and Coding Design, by Roberto Togneri and Christopher J.S. deSilvaChapman & Hall / CRC

18The end of the Conditional Entropy slide show has come.End of the Slides19

This ends our Joint and Conditional Entropy slide sequence.19