Wednesday 26 October 2011

What we can learn from monkeys (part 2)

If, the saying goes, you take enough monkeys and typewriters then given enough time it is statistically probable that they will reproduce the entire works of Shakespeare. Not a word or a phrase at a time, but each play as a coherent whole. The chance of the letter "n" being pressed, out of the 46 keys on our typewriter, is 1/46, the same for "o", the same for "w", for " "[1], for "i","s"," ","t","h","e"," ","w","i","n","t","e","r" and so on until we get the complete and excellent opening line of Richard III:
"Now is the winter of our discontent made glorious summer by this sun of York"
The probability of this occurring with just one monkey is 1/46*1/46*1/46 and so on for the number of characters (76). To be more specific, the probability of creating just this first line is

1 in a number far too big to even write out in normal numbers (46^76)

Hmmm. Ok, to bring it down to numbers worth writing out, let's just look at the first two words "Now is", a total of 6 characters, which the chances of one monkey randomly tapping out is

1 in 9,474,296,896 (making the odds of winning the UK lottery jackpot of 1 in 13,000,000 seem positively likely!)

But with enough monkeys and enough time it becomes more and more likely that not only will one them tap out those opening 6 characters, but also the opening 76 characters, and even the all of the large number of characters that is Richard III, all the rest of Shakespeare's surviving plays and even the lost ones (although I'm not sure how we would know the lost ones had been correctly typed out...). By combining an infinite number of monkeys and typewriters, it would not be statistically significant that the monkeys produced the complete works; if you started with nothing and in 6 weeks someone delivers a typewritten manuscript the fact that infinite monkeys were involved means that statistically it would be not unlikely i.e. not *mathematically* improbable that the manuscript has been produced by pure random chance by a bunch of monkeys with typewriters.
Arthur looked up. "Ford!" he said, "there's an infinite number of monkeys outside who want to talk to us about this script for Hamlet they've worked out."[2]
So here our ever helpful monkeys are helping teach us something about statistics; that they are a dangerous source of truths. While not mathematically improbable, this truth is heavily dependent on a few highly unlikely things, like having an infinite living space in which to house infinite numbers of monkeys with their typewriters.

This becomes even more of an issue when numbers turned into statistics to be used by politicians and news outlets and those others with an agenda, who all too often mistake correlation for causation, using statistics to demonstrate why some new policy or other is needed or why a current one should be changed, when (1) not understanding that statistical significance is mostly about having, or assuming, the right amount of monkeys and that (2) the fact that two measurements correlate does not mean one caused the other. News outlets in particular also have a tendency to reproduce statistics as the agenda-pusher would have them reproduced "the murder rate in the country has gone up 10%" (was 10 in year 1, 11 in year 2); the same numbers could just as easily, and probably less misleadingly, have been reported with a more qualitative statement such as "the murder rate in the country was stable".

It is this misuse and misunderstanding that help give rise to the saying "Lies, damned lies and statistics"[3]; but everyone is at it. Internal "news outlets" (news inlets?) are equally prone to misleading messages of the types "90% of users rate the IT support service as 4/5 or higher", which is more precisely reported as "of the people who got around to filling out the satisfaction survey when their helpdesk ticket was closed, 90% rated the service as 4/5 or higher". A statement that would be further informed by the information that the survey defaults to 5 and you have to change it to anything lower; then take into account such truisms as people are more likely to complain than praise. Eventually a qualitative statement would work out to be more useful "the IT support service is making very few people angry", which if new software is being rolled out is good news indeed!

Qualitative statements are seen to carry less weight than ones laden with numbers, which in organisations is probably the fault of the CFO; this seems ironic, given the quantity of assumptions and informed guesswork that is the basis of corporate accountants...

In the end remember this: statistics just provide information in a numerical form. What it all actually means is a matter of interpretation. In other words don't be mislead into believing the numbers are not just another qualitative measure...

[1] Although surely the space bar is so much bigger it would be more likely to be pressed? Damn these complications and assumptions...
[2] Douglas Adams, The Hitchhikers Guide to the Galaxy, with a little help from the Improbability Drive
[3] Said by someone, some time in some form: http://www.york.ac.uk/depts/maths/histstat/lies.htm