

So the presence of a single cluster, or a number of small clusters of cases, is entirely normal. If patients are spread out perfectly evenly, the distribution would be most un-random indeed! If the disease strikes at random (and the environment has no effect) we would expect to see numerous clusters of patients as a matter of course. Or fluoride – in small amounts it is one of the most effective preventative medicines in history, but the positive effect disappears entirely if one only ever considers toxic quantities of fluoride.įor similar reasons, it is important that the procedures for a given statistical experiment are fixed in place before the experiment begins and then remain unchanged until the experiment ends.Ĥ) Clustering – which is to be expected even in completely random data.Ĭonsider a medical study examining how a particular disease, such as cancer or Multiple sclerosis, is geographically distributed. This is bad statistical practice, but if done deliberately can be hard to spot without knowledge of the original, complete data set.Ĭonsider the above graph showing two interpretations of global warming data, for instance. ģ) Data mining – when an abundance of data is present, bits and pieces can be cherry-picked to support any desired conclusion.

The skeptics see period of cooling (blue) when the data really shows long-term warming (green). Picking and choosing among the data can lead to the wrong conclusions. If diagnostic methods improve, some very-slightly-unhealthy patients may be recategorised – leading to the health outcomes of both groups improving, regardless of how effective (or not) the treatment is. This has major implications in medical studies, where patients are often sorted into “healthy” or “unhealthy” groups in the course of testing a new treatment. The “tall”‘ group lose their shortest member, thus bumping up their average height – but the “short” group gain their tallest member yet, and thus also gain in average height. Simply ask the shortest person in the “tall” group to switch over to the “short” group. Having done so, it’s surprisingly easy to raise the average height of both groups at once.

To illustrate, imagine dividing a large group of friends into a “short” group and a “tall” group (perhaps in order to arrange them for a photo). When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states. This is also known as the Will Rogers effect, after the US comedian who reportedly quipped: We would therefore expect them to be significant healthier than office workers, on average, and should rightly be concerned if they were not.Ģ) Categorisation and the Stage Migration Effect – shuffling people between groups can have dramatic effects on statistical outcomes. No! The groups are not on the same footing: the astronaut corps screen applicants to find healthy candidates, who then maintain a comprehensive fitness regime in order to proactively combat the effects of living in “microgravity”. If the study shows no significant difference between the two – no correlation between healthiness and working environment – are we to conclude that living and working in space carries no long-term health risks for astronauts? To help keep your guard up, here are some common slippery statistical problems that you should be aware of:ġ) The Healthy Worker Effect, where sometimes two groups cannot be directly compared on a level playing field.Ĭonsider a hypothetical study comparing the health of a group of office-workers with the health of a group of astronauts.

Entire books have been written on the subtle ways in which statistics can be misinterpreted (or used to mislead). Unfortunately, analysing statistics, probabilities and risks is not a skill set wired into our human intuition, and so is all too easy to be led astray. One recent example of the need for caution in interpreting data is the excitement earlier this year surrounding the apparent groundbreaking detection of gravitational waves – an announcement that appears to have been made prematurely, before all the variables that were affecting the data were accounted for.
