INFORMATION THEORY SIMPLIFIED POLYNESIAN LANGUAGE EXAMPLE Thomas Tiahrt, MA, PhD CSC492 – Advanced...

8
INFORMATION THEORY SIMPLIFIED POLYNESIAN LANGUAGE EXAMPLE Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics

Transcript of INFORMATION THEORY SIMPLIFIED POLYNESIAN LANGUAGE EXAMPLE Thomas Tiahrt, MA, PhD CSC492 – Advanced...

Information Theorysimplified Polynesian Language ExampleThomas Tiahrt, MA, PhDCSC492 Advanced Text Analytics

Hello and Welcome to CSC 492 Advanced Text Analytics. We continue our overview of information theory with a discussion of simplified Polynesian.

1Simplified Polynesian Example 12ptkaiu1/81/41/81/41/81/8

We begin by looking at an artificial language derived from Polynesian. There are about forty different Polynesian languages, including Tahitian, Samoan and Hawaiian. We wont be examining Polynesian as it actually exists, but instead will use this simplified version as way to learn about entropy.

Polynesian languages are remarkable for their small alphabets. Here we are assuming that there are just six letters. Four of the letters, p, k, i and u, have a one out of eight frequency and two letters, t and a, have a one out of four frequency. Note that collectively the vowels as a group, and the consonants as a group, have equal probability. The frequency distribution here is a per letter distribution.2Simplified Polynesian Example 13ptkaiu1/81/41/81/41/81/8

To calculate the entropy we place the probabilities into our formula. After doing the math we find that the per letter entropy is two and one half.3Simplified Polynesian Example 14ptkaiu1/81/41/81/41/81/8ptkaiu1000010101110111

The per letter entropy of two and one half is confirmed by our ability to design a binary code that will take, on the average, two and one half bits to transmit a letter. The code itself uses fewer bits to send letters that occur more frequently, but it also must capable of being unambiguously decoded. One way to decode the messages is to use the fact that when a code begins with a zero it will be two bits long, and when a code begins with a one it will be three bits long. 4Simplified Polynesian Example 15ptkaiu1/81/41/81/41/81/8ptkaiu1000010101110111Source Coding

There is a great deal of work that has been done in information theory about creating such a code. It is called source coding, and its purpose is to find a code that makes storing and transmitting information efficient. It converts a source message into a binary code word that can be decoded without losing any information. Because we are interested in decoding information we do not need to explore source coding any further, except to note the ambiguity of source code, in which we write programming language statements, and a source code in information theory, where it is the creation and use of an efficient and unique pattern for transmitting and storing information. Unfortunately the only way we can know is to know the context in order to disambiguate the two senses of the phrase source code.

5Simplified Polynesian Example 16ptkaiu1/81/41/81/41/81/8ptkaiu1000010101110111Twenty Questions

Another way to think about entropy is to liken it to the Twenty Questions game. In the twenty questions game one or more players are inquisitors. The inquisitors ask another player, the keeper of the answer, about an entity the keeper has thought of beforehand, a sequence of yes/no questions. That is, the keeper can only answer yes or no. The answers then, can be thought of as a binary response, which we could encode using zeros and ones.

You can imagine playing the same game here to discover the letters in a message by beginning with the consonants, or vowels, and then asking about the higher probability letters and working on through the lower probability questions. If you ask good questions you should be able to take no more than 2 questions to discover the each letter. That is exactly what entropy is measuring here. The entropy measures the size of the search space of possible values of a random variable and the probability values corresponding to the random variable values.6References7Sources:Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich SchtzeThe MIT PressFundamentals of Information Theory and Coding Design, by Roberto Togneri and Christopher J.S. deSilvaChapman & Hall / CRC

7The end of part 1 of simplified Polynesian has come.End of Part 18

This ends our initial discussion of simplified Polynesian. We will return to simplified Polynesian later.8