A million monkeys and Shakespeare

3
december2011 190 © 2011 The Royal Statistical Society A million monkeys and Shakespeare “Who’s there?” “Nay, answer me. Stand and unfold yourself”… And so Hamlet begins. As is well known, if a monkey is placed at a typewriter and hits the keys at random, independently, he or she will produce a word-for-word reproduction of the text of Hamlet – eventually. The “eventually” is important. Our monkey has one chance in 26 of hitting W as his first letter; one in 26×26 (= 1 in 676) of hitting W as the first and h as the second letter; 1 in 26 3 of getting the entire word “Who”; and, ignoring things like punctuation, spaces, capitals and apostrophes, one in 26 to the power of 42 of getting the 42 letters in that first line of Hamlet right first time – which is, roughly, 1 chance in 3×10 59 . There are about 130 000 letters in the whole of Hamlet. The chance of randomly typing out the entire play correctly at first trial therefore works out at one in 26 130,000 , which is one in 3.4 × 10 183 946 . A single monkey would need that kind of number of attempts before there was a reasonable chance of his or her getting it right. This would take a very long time. In fact it is a time that is many orders of magnitude longer than the life of the universe (let alone the life of the monkey). 10 183 946 is a number that is mind- bogglingly big. As a comparison, there are only about 10 80 particles in the observable universe. Even using more than one monkey does not help much. If we took the same number of mon- keys as there are particles in the universe (10 80 ), and each typed 1000 keystrokes per second for 100 times the life of the universe (which is 10 20 seconds), we would still find the probability of the monkeys replicating even a short book to be impossibly small. But my monkeys have reproduced, by random typing, not only Hamlet, but also Macbeth, King Lear, Othello and Shakespeare’s other tragedies, and the comedies, and the histories, and the sonnets as well – the whole works in fact. Have I invented time travel? Have I ventured into alternative universes? Have I bred a race of super-intelligent and well-read monkeys, each of them with degrees in literature and longing to play the Prince of Denmark on Broadway? Alas, no. Nevertheless my monkeys have done it. They began on 21st August this year, and on 23rd September at 2.30 Pacific Standard Time they successfully completed their first Shakespearean effort, A Lover’s Complaint: From off a hill whose concave womb re-worded A plaintful story from a sistering vale, My spirits to attend this double voice accorded, And down I laid to list the sad-tuned tale… and so on for 45 stanzas. Their progress through the other works was followed live on the Web by some 25 000 viewers as they completed each of Shakespeare’s works one by one. On 6th October at 2 a.m. the monkeys typed the last missing phrase of their last uncompleted work and finished their task. The entire exercise had taken a month and a half. We have not had to wait until the universe expired. The million-monkey comparison goes back some way. It is often called the infinite monkey It is the ultimate improbability. If a million monkeys sat at a million typewriters, they would eventually reproduce all the works of Shakespeare. Now they have done it. Or have they? On 6th October this year the BBC, CNN and world media reported that the last of the monkeys finished the job. Jesse Anderson was the monkey-trainer. While Shakespeare gibbers, he explains what he – and the monkeys – have and have not done. © iStockphoto.com/Valery Seleznev

Transcript of A million monkeys and Shakespeare

Page 1: A million monkeys and Shakespeare

december2011190 © 2011 The Royal Statistical Society

A m i l l i on monkey s and Shakespeare

“Who’s there?” “Nay, answer me. Stand and unfold

yourself”…

And so Hamlet begins. As is well known, if a monkey is placed at a typewriter and hits the keys at random, independently, he or she will produce a word-for-word reproduction of the text of Hamlet – eventually. The “eventually” is important. Our monkey has one chance in 26 of hitting W as his first letter; one in 26×26 (= 1 in 676) of hitting W as the first and h as the second letter; 1 in 263 of getting the entire word “Who”; and, ignoring things like punctuation, spaces, capitals and apostrophes, one in 26 to the power of 42 of getting the 42 letters in that first line of Hamlet right first time – which is, roughly, 1 chance in 3×1059.

There are about 130 000 letters in the whole of Hamlet. The chance of randomly typing out the entire play correctly at first trial therefore works out at one in 26130,000, which is one in 3.4 × 10183 946. A single monkey would need that kind of number of attempts before there was a reasonable chance of his or her getting it right. This would take a very long time. In fact it is a time that is many orders of magnitude longer than the life of the universe (let alone the life of the monkey). 10183 946 is a number that is mind-bogglingly big. As a comparison, there are only about 1080 particles in the observable universe.

Even using more than one monkey does not help much. If we took the same number of mon-

keys as there are particles in the universe (1080), and each typed 1000 keystrokes per second for 100 times the life of the universe (which is 1020 seconds), we would still find the probability of the monkeys replicating even a short book to be impossibly small.

But my monkeys have reproduced, by random typing, not only Hamlet, but also Macbeth, King Lear, Othello and Shakespeare’s other tragedies, and the comedies, and the histories, and the sonnets as well – the whole works in fact. Have I invented time travel? Have I ventured into alternative universes? Have I bred a race of super-intelligent and well-read monkeys, each of them with degrees in literature and longing to play the Prince of Denmark on Broadway? Alas, no. Nevertheless my monkeys have done it. They began on 21st August this year, and on 23rd September at 2.30 Pacific Standard Time they successfully completed their first Shakespearean effort, A Lover’s Complaint:

From off a hill whose concave womb re-worded

A plaintful story from a sistering vale,My spirits to attend this double voice

accorded,And down I laid to list the sad-tuned

tale…

and so on for 45 stanzas. Their progress through the other works was followed live on the Web by some 25 000 viewers as they completed

each of Shakespeare’s works one by one. On 6th October at 2 a.m. the monkeys typed the last missing phrase of their last uncompleted work and finished their task. The entire exercise had taken a month and a half. We have not had to wait until the universe expired.

The million-monkey comparison goes back some way. It is often called the infinite monkey

It is the ultimate improbability. If a million monkeys sat at a million typewriters, they would eventually reproduce

all the works of Shakespeare. Now they have done it. Or have they? On 6th October this year the BBC, CNN and

world media reported that the last of the monkeys finished the job. Jesse Anderson was the monkey-trainer. While

Shakespeare gibbers, he explains what he – and the monkeys – have and have not done.

© iStockphoto.com/Valery Seleznev

Page 2: A million monkeys and Shakespeare

december2011 191

theorem, though “infinite” is a misnomer. An infinite number of monkeys would actually produce Shakespeare, and indeed every other work of every other author there has ever been or ever will be, and would do it very fast – in fact as fast as they could be typed. Our version, though, appears to have been devised in 1913 by Émile Borel, who was writing about thermo-dynamics1. The second law of thermodynamics is essentially statistical. It is about random movements of molecules. Borel wrote that if a million monkeys typed for 10 hours a day, it was extremely unlikely that their output would exactly equal all the books of the richest librar-ies of the world; and yet, in comparison, it was even more unlikely that the laws of statistical mechanics would ever be violated. The physicist Eddington put it into English: “If an army of monkeys were strumming on typewriters they

might write all the books in the British Museum. The chance of their doing so is decidedly more favourable than the chance of all the molecules of a gas returning to one half of the vessel [that contains them]”. Creationists have tried to use the comparison to claim the impossibility of in-telligent life having evolved by chance. Random combinations of DNA, they say, could not pos-sibly produce the complexity of the human (or the monkey) genome. It is as unlikely as – you have guessed it – random letters from monkeys producing Shakespeare and is therefore, they say, impossible.

But as Richard Dawkins, among others, has pointed out, the creationists have failed to take note of one crucial word in the monkey theorem: we said at the start that the monkey hits the keys independently. That word “independently” matters.

Because my monkeys did not type out let-ter after letter independently and indefinitely. Instead, after every nine letters their output was compared to Shakespeare, and was then either rejected as gibberish or selected as being Shake-spearian. Selecting and holding on to successful guesses makes all the difference, to evolution and creationism, to Shakespeare, and to me. This is the distinction that creationists fail to acknowledge, and that Dawkins emphasises; selecting as you go along changes the statistics, the probabilities – and the timescale.

You can read Dawkins, or more succinctly, go to the Significance website (http://www.significancemagazine.org/details/webexclusive/1353119/Monkey-business.html) and read Lewis Jones, to see how this demolishes the creationists’ argument. Unselected randomness takes almost for ever.

Newsflash! Infinite number of monkeys produce Shakespeare! Chimpanzees say “We’re still trying…”

Page 3: A million monkeys and Shakespeare

december2011192

Cumulative selection works rather quicker. And that is the method that I used.

I had not an infinite number of monkeys, but only a million of them. I should say at this point that no monkeys were harmed in the course of my experiment. My monkeys were virtual. So were their typewriters. (This was, among other things, an economy meas-ure. The computing costs of the project were about $19.20 a day. Bananas to feed a million monkeys would have been considerably more expensive.) I created a computer program us-ing the Hadoop framework to simulate a million monkeys randomly typing.

The idea came from one of my favourite Simpsons episodes which has a scene where Mr Burns brings Homer to his mansion (http://www.youtube.com/watch?v=JcSUWP0QNeY).

One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens: “It was the best of times, it was blurst of times.” I thought I would emulate Mr Burns.

My million monkeys started out in cloud-space on Amazon EC2; they used more RAM than the free service provided and I moved them to my home computer. Each virtual monkey put out random gibberish nine letters at a time. This was supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program compared each nine-letter segment to every work of Shakespeare, as digitised by the Gutenberg project, (http://www.gutenberg.org/ebooks/100) to see if it actually matched a small portion of what Shakespeare wrote. The character group can be matched anywhere in the work, immaterial of the order or whether any or all of the preceding portions of that work had been matched already. If it does match, that por-tion of Shakespeare is marked to show it has been reproduced by a monkey. Thus if the monkey’s random nine-letter output was “OBEORNOTT” it would be a match, because Shakespeare also wrote that nine-letter combination, in “To be or not to be”. If the monkey wrote “QQQZXYQAB” that group would be discarded, because Shake-speare nowhere wrote those letters in that order.

This process is repeated over and over until every portion of every work of Shakespeare – all 3.7 million letters of them – has been covered by

the monkey’s gibberish. The monkeys have then re-created every work of Shakespeare.

The monkeys create nine characters of random gibberish at a time. Many have asked why I used nine characters and not one or some other number. There are two reasons for this decision. The first is that a 1-, 2-, or 3-character group would not be sporting. To test the cor-rect performance of my code I actually ran all of Shakespeare through a one-character group. All of the works of Shakespeare are created in 20 seconds using this method. I doubt I would have received any kind of attention by announcing I had recreated Shakespeare in 20 seconds flat. The second reason is the scope of the work. I wanted the monkeys project to complete in 1–2 months. I ran some projections and did some back-of-envelope calculations. For my computer, the nine-character group size was just right. Given the exponential nature of the problem, going one character up or down made a very big difference.

For the more technically minded, the program was written in Java using the Hadoop MapReduce framework and runs on Ubuntu. The random data source or pseudo-random number generator is Sean Luke implementation of a Mersenne twister. The nine-character groups are passed into a Bloom filter (http://en.wikipedia.org/wiki/Bloom_­filter) for a quick check to see if the group could appear in Shakespeare. A Bloom filter is a hash to check if a piece of data might appear in a source material. After that, the nine-character groups are run through a full check to see where in Shakespeare the group appeared. There is a more in-depth technical discussion on my You-Tube channel (http://www.youtube.com/watch?v=JZpM_MlZFqE).

The monkeys ran 180 billion character groups a day. An average iteration lasted 30 minutes 33 seconds and ran 5 billion character groups. There are 5 429 503 678 976 possible combinations of nine letters; the monkeys ran 7 445 912 000 000 character groups, so clearly many of the combi-nations occurred several times. 1 982 507 of the character groups found by the monkeys occurred also in Shakespeare, and those character groups were found 3 788 175 times, giving a repetition ratio of 1.872. The project ran for 46 days before

completing its last work, which was a phrase from The Taming of the Shrew.

Throughout the project, I found that statisti-cians and mathematicians had the most adverse reaction to it. Having an article published in their magazine just gets me into the belly of the beast. I write this article with the hope that my hate mail does not increase. The project was thoroughly tongue in cheek, but it did require some technical skill. There is no need to e-mail and set me straight on understanding the infinite monkey theorem. I understand that the orthodox version of the theorem is that it is done all at once, by one monkey. This is a method I came up with to create a result without access to infinite resources. If you do have access to an infinite resource, please do e-mail me and we can run the project there.

Many have asked what value this project is if it is not the orthodox version. For me it is performance art with monkeys and computers. I wanted to make it engaging and to have people coming back to check the monkeys’ progress, which is why I did near real-time updates of the site. I think the project was a resounding suc-cess. It achieved its primary goal of recreating every work of Shakespeare. People saw my work. In addition to the 25 000 unique visitors to the site, millions more read about the work on main-stream news, blogs, print and radio. People were tweeting it and liking it on Facebook. I consider the social networking aspect the most gratifying part of the project; people enjoyed it and wanted to share it with others.

For those interested, the Million Monkeys project’s source code (http://code.goog-le.com/p/million-monkeys-project) is available to re-create your own works from your favourite author. Also, the data output is available to those who e-mail me asking for it.

This is the first time a work of Shakespeare has actually been randomly reproduced and the first time every work of Shakespeare has been randomly reproduced. Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere.

References1. Borel, É. (1913). Mécanique statistique

et irréversibilité. Journal de Physique, 5e série, 3, 189–196.

Jesse Anderson is a Senior Software Engineer at Intuit, Inc. in Reno, Nevada, USA. In addition to training monkeys to take over the world he creates distributed systems using Hadoop, mobile program-ming and human–computer interfaces, and watches The Simpsons. You can contact him on his personal branding site with questions or comments (http://www.jesse-anderson.com/contact/).

Each monkey typed gibberish nine letters at a time. The gibberish was

compared to Shakespeare

In 2003, Paignton Zoo carried out a practical attempt to test the monkeys and Shakespeare idea by putting a keyboard connected to a PC into the cage of six Sulawesi crested macaques. After a month the monkeys had produced five pages of the letter “S” and had broken the keyboard.