Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s...

13
Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10, 2006 Sudan Studies Association Conference Rhode Island College

Transcript of Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s...

Page 1: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques:

Zipf’s Law and Word Co-Occurrences

Reginald SmithAugust 10, 2006

Sudan Studies Association ConferenceRhode Island College

Page 2: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Meroitic is the language of the ancient kingdom of Kush

• Used for almost six hundred years from 2nd century BCE to 4th century CE

• Phonetic language written right to left (like Arabic)

• Transliteration made possible by work of British archaeologist FL Griffith around 1910

Page 3: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Meroitic remains largely undeciphered and an enigma

• No complete vocabulary is available• Some words such as place names, loan words,

or simple concepts are known– For example or “qore” means king– Perhaps or “qes” is Kush

• Many attempts have been made to understand Meroitic using phonology or comparative linguistics– Scholars have tried in vain to find a known language

that is a relative (see sources in paper)– We wish we had a bilingual text like the Rosetta stone

to guide us

Page 4: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

A new method could use mathematics and linguistics

• Statistical natural language processing analyzes the properties of language using a mix of statistics and linguistics

• There are several properties of languages that are the same in all human languages

• Certain techniques can also help us possibly infer meanings of words (by relating them to other known words)

Page 5: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Zipf’s Law: Frequencies of Words

• If you rank order words in a text by how frequent (# of times a word appears) they are (#1 being most frequent) and then relate this to the frequency of the word, you get Zipf’s Law

• Zipf’s Law: where F is the frequency of a word, C is a constant, R is the rank, and α is known as the power law exponent

• For all languages α ≈ 1F CR (1)

Page 6: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Zipf Law Graphs• When you graph the frequency vs. the rank on a log-log

graph (graphing the logarithm of frequency vs. the logarithm of rank) you get a straight line whose slope is α

Picture Source: University of Helsinki CS department

Zipf line fit on data. The red line is the fitted slope on the data points

Page 7: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Does Meroitic follow Zipf’s Law?• The two graphs below show log-log plots of frequency

vs. rank for the Meroitic words in 69 texts. The slopes are shown for each– The normal plot counts the words as is. The morpheme out plot

split out suffixes like –lowi as the separate words “lo” and “wi”– Since it has a slope of nearly -1 the morpheme out model of

Meroitic seems to follow Zipf’s Law

Normal plot Slope = -0.81

Morpheme out plot Slope = -1.03

Page 8: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

So what does this show us (besides graphs)

• Despite the apparently low amount of texts available, our sample of Meroitic is structured just like all other human languages (English, Chinese, etc.)

• Therefore, even though we don’t know the meaning of the words, we know that the language we have is representative– Even though most of our samples are redundant

funeral stelae• We can then proceed to use other statistical

techniques on Meroitic and also compare its statistical features to other languages

Page 9: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Step Two: Word Co-occurrence

• When words occur together in a text, they are said to co-occur– “I am here” has co-occurrence between “I-am” and

“am-here”

• Co-occurrences can tell us about the words if we have enough of them– Words that co-occur with the same words often have

similar parts of speech or even meanings– Can we use word co-occurrence in Meroitic to

analyze classes of words?

Page 10: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

What I did with Meroitic

• I analyzed Meroitic by matching together words that co-occurred with the same types of words

• For example if you have two sentences: “I eat horses” and “We eat lizards”– I match “I” and “We” because they both co-occur with

“eat”– I also match “horses” and “lizards” because they also

co-occur with “eat” (in the opposite direction*)

• I then graph connected words together and analyze them with software– What happens?

*Technical note: I actually used undirected edges for co-occurring words in the graph shown on the next page

Page 11: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Meroitic Words Graph

• Four main groups of words form that correspond well to Meroitic categories including positions and titles, verbs, places, and miscellaneous nouns

Group 4

Group 2

Group 1

Group 3

Page 12: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Results

• Techniques like the word co-occurrence matching can help us categorize Meroitic words that we previously guessed on by mapping them against words we already know the part of speech for

• Similar statistical techniques may allow us to match words with a similar “meaning” to infer the meanings of some words– This is still speculative though

Page 13: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,

Conclusion

• Statistical natural language processing is a new approach to Meroitic that could supplement other current efforts in the language

• Much more work remains to be done, but this new avenue may help us move closer to the goal of understanding this beautiful and mysterious language

• Acknowledgements: I give my boundless appreciation to Dr. Richard Lobban and Dr. Laurance Doyle for the help and advice they gave me on this paper’s topics