Facts vs. Factions: the Use and Abuse of Subjectivity in Scientific Research

By Robert Matthews

There seemed no doubt about it: if you were going to have a heart attack, there was never a better time than the early 1990s. Your chances of survival appeared to be better than ever. Leading medical journals were reporting results from new ways of treating heart attack victims whose impact on death-rates wasn't just good--it was amazing. 

In 1992, trials in Scotland of a clot-busting drug called anistreplase suggested that it could double the chances of survival. A year later, another "miracle cure" emerged: injections of magnesium, which studies suggested could also double survival rates. Leading cardiologists hailed the injections as an "effective, safe, simple and inexpensive" treatment that could save the lives of thousands. 

But then something odd began to happen. In 1995, the Lancet published the results of a huge international study of heart attack survival rates among 58,000 patients - and the amazing life-saving abilities of magnesium injections had simply vanished. Anistreplase fared little better: the current view is that its real effectiveness is barely half that suggested by the original trial. 

In the long war against Britain's single biggest killer, a few disappointments are obviously inevitable. And over the last decade or so, scientists have identified other heart attack treatments which in trials reduced mortality by up to 30 percent. 

But again, something odd seems to be happening. Once these drugs get out of clinical trials and onto the wards, they too seem to lose their amazing abilities. 

Last year, Dr Nigel Brown and colleagues at Queen's Medical Centre in Nottingham published a comparison of death rates among heart attack patients for 1989-1992 and those back in the clinical "Dark Ages" of 1982-4, before such miracles as thrombolytic therapy had shown success in trials. Their aim was to answer a simple question: just what impact have these "clinically proven" treatments had on death rates out on the wards? 

Judging by the trial results, the wonder treatments should have led to death rates on the wards of just 10 percent or so. What Dr Brown and his colleagues actually found was, to put it mildly, disconcerting. Out on the wards, the wonder drugs seem to be having no effect at all. In 1982, the death rate among patients admitted with heart attacks was about 20 percent. Ten years on, it was the same: 20 percent - double the death rate predicted by the clinical trials. 

In the search for explanations, Dr Brown and his colleagues pointed to the differences between patients in clinical trials - who tend to be hand-picked and fussed over by leading experts - and the ordinary punter who ends up in hospital wards. They also suggested that delays in patients arriving on the wards might be preventing the wonder drugs from showing their true value. 

All of which would seem perfectly reasonable - except that heart attack therapies are not the only "breakthroughs" that are proving to be damp squibs out in the real world. 

Over the years, cancer experts have seen a host of promising drugs dismally fail once outside clinical trials. In 1986, an analysis of cancer death rates in the New England Journal of Medicine concluded that "Some 35 years of intense effort focused largely on improving treatment must be judged a qualified failure". Last year, the same journal carried an update: "With 12 more years of data and experience", the authors said, "We see little reason to change that conclusion". 

Scientists investigating supposed links between ill-health and various "risk factors" have seen the same thing: impressive evidence of a "significant" risk - which then vanishes again when others try to confirm its existence. Leukaemias and overhead pylons, connective tissue disease and silicone breast implants, salt and high blood pressure: all have an impressive heap of studies pointing to a significant risk - and an equally impressive heap saying it isn't there. 

It is the same story beyond the medical sciences, in fields from psychology to genetics: amazing results discovered by reputable research groups which then vanish again when others try to replicate them. 

Much effort has been spent trying to explain these mysterious cases of The Vanishing Breakthrough. Over-reliance on data from tiny samples, the reluctance of journals to print negative findings from early studies, outright cheating: all have been put forward as possible suspects. 

Yet the most likely culprit has long been known to statisticians. A clue to its identity comes from the one feature all of these scientific disciplines have in common: they all rely on so-called "significance tests" to gauge the importance of their findings. 

First developed in the 1920s, these tests are routinely used throughout the scientific community. Thousands of scientific papers and millions of pounds of research funding have been based on their conclusions. They are ubiquitous and easy to use. And they are fundamentally and dangerously flawed. 

Used to analyse clinical trials, these textbook techniques can easily double the apparent effectiveness of a new drug and turn a borderline result into a highly "significant" breakthrough. They can throw up convincing yet utterly spurious evidence for "links" between diseases and any number of supposed causes. They can even lend impressive support to claims for the existence of the paranormal. 

The very suggestion that these basic flaws in such widely-used techniques could have been missed for so long is astonishing. Alto- gether more astonishing, however, is the fact that the scientific community has been repeatedly warned about these flaws - and has ignored the warnings. 

As a result, thousands of research papers are being published every year whose conclusions are based on techniques known to be unreliable. The time and effort - and public money - wasted in trying to confirm the consequent spurious findings is one of the great scientific scandals of our time. 

The roots of this scandal are deep, having their origins in the work of an English mathematician and cleric named Thomas Bayes, published over 200 years ago. In his "Essay Towards Solving a Problem in the Doctrine of Chances", Bayes gave a mathematical recipe of astonishing power. Put simply, it shows how we should change our belief in a theory in the light of new evidence. 

One does not need to be a statistician to see the fundamental impor- tance of "Bayes's Theorem" for scientific research. From studies of the cosmos to trials of cancer drugs, all research is ultimately about finding out how we should change our belief in a theory as new data emerge. 

For over 150 years, Bayes's Theorem formed the foundation of statistical science, allowing researchers to assess the meaning of new results. But during the early part of this century, a number of influential mathematicians and philosophers began to raise objections to Bayes's Theorem. The most damning was also the simplest: different people could use Bayes's Theorem and get different results. 

Faced with the same experimental evidence for, say, ESP, true believers could use Bayes's Theorem to claim that the new results implied that telepathy is almost certainly real. Skeptics, in contrast, could use Bayes's Theorem to insist they were still not convinced. 

Both views are possible because Bayes's Theorem shows only how to alter one's prior level of belief - and different people can start out with different opinions. 

To non-scientists, this may not seem like an egregious failing at all: what one person sees as convincing evidence may obviously fail to impress others. No matter: the fact that Bayes's Theorem could lead different people to different conclusions led to its being inextricably linked to the most rebarbative concept known to scientists: subjectivity. 

It is hard to convey the emotions roused within the scientific community by the S-word. Subjectivity is seen as the barbarian at the gates of science, the enemy of objective truth, the destroyer of insight. It is seen as the mind-virus that has turned the humanities into an intellectual free-for-all, where the idea of "progress" is dismissed as bourgeois, and the belief in "facts" naïve. Once allowed into the citadel of science, runs the argument, subjectivity would turn all research into glorified literary criticism. 

By the 1920s, Bayes's Theorem had all but been declared heretical - which created a problem: what were scientists going to replace it with? The answer came from one of Bayes's most brilliant critics: the Cambridge mathematician and geneticist, Ronald Aylmer Fisher - father of modern statistics. 

Few scientists had greater need of a replacement for Bayes than Fisher, who frequently worked with complex data from plant breeding trials. Drawing on his great mathematical ability, he set about finding a new and completely objective way of drawing conclusions from experiments. By 1925, he believed he had succeeded, and pub- lished his techniques in a book, "Statistical Methods for Research Workers". It was to become one of the most influential texts in the history of science, and laid the foundations for virtually all the statistics now used by scientists. 

On the face of it, Fisher had achieved what Bayes claimed was impossible: he had found a way of judging the "significance" of experimental data entirely objectively. That is, he had found a way that anyone could use to show that a result was too impressive to be dismissed as a fluke. 

All scientists had to do, said Fisher, was to convert their raw data into something called a P-value, a number giving the probability of getting at least as impressive results as those seen by chance alone. If this P- value is below 1 in 20, or 0.05, said Fisher, it was safe to conclude that a finding really was "significant". 

Combining simplicity with apparent objectivity, Fisher's P-value method was an immediate hit with the scientific community. Its popularity endures to this day. Open any leading scientific journal and you will see the phrase "P < 0.05" - the hallmark of a significant finding - in papers on every conceivable area of research, from astronomy to zoology. Every year, new statistics textbooks appear to explain Fisher's simple little recipe to a new generation of researchers. 

But just as scientists were adopting P-values, a few awkward questions started to be asked by other statisticians. The most telling was raised by the distinguished Cambridge mathematician Harold Jeffreys. Writing in his own treatise on statistics, Theory of Probability, published in 1939, Jeffreys asked an obvious question: just why should the dividing line for significance be set at Fisher's value of 0.05? 

This seemingly innocuous question has profound implications, for Fisher's figure of 0.05 is still the sine qua non for deciding if a scientific result is "significant". All scientists know that if their experiment gives a P-value meeting Fisher's standard they are on their way to having a publishable paper. Fisher's standard is even more important for pharmaceutical companies, as national regulatory organisations still use Fisher's 0.05 figure to decide whether to approve a new drug for general release. Getting drug trial results with P-values that beat Fisher's standard can thus make the difference between millions in profits or bankruptcy. 

So just what were the brilliant insights that led Fisher to choose that talismanic figure of 0.05, on which so much scientific research has since stood or fallen? Incredibly, as Fisher himself admitted, there weren't any. He simply decided on 0.05 because it was mathe- matically convenient. 

The implications of this are truly disturbing. It means that key scientific questions such as whether a new heart drug is seen as effective or whether diet really is linked to cancer are being decided by an entirely arbitrary standard chosen over 70 years ago for mathematical "convenience". 

This would not matter if Fisher had been lucky, and chosen a figure that makes the risk of being fooled by a fluke result very low. Yet statisticians now know that his choice was a particularly bad one - and that many supposedly "significant" findings are in fact entirely spurious. 

The first hints of this deeply worrying feature of Fisher's methods first emerged as long ago as the early 1960s, following a resurgence of interest in Bayes's Theorem. Many of the supposedly "insuperable" objections to its use were shown to be baseless, and the theorem has since emerged as one of the axioms of the entire theory of probability. As such, its implications for statistics cannot be wished away - no matter how noisome scientists might find them. 

And the most important of those implications is that - as Bayes himself had insisted 200 years ago - it is indeed impossible to judge the "significance" of data in isolation. Crucially, the plausibility of the data has to be taken into account. 

Using Bayes's Theorem, a number of leading statisticians began to probe the reliability of P-values as a measure of significance. What they discovered could hardly be more serious. 

On the face of it, Fisher's standard of 0.05 suggests that the chances of a mere fluke being the real explanation for a given result is just 5 in 100 - plenty of protection against being fooled. But in 1963, a team of statisticians at the University of Michigan showed that the actual chances of being fooled could easily be 10 times higher. Because it fails to take into account plausibility, Fisher's test can see "significance" in results which are actually over 50 percent likely to be utter nonsense. 

The team - which included Professor Leonard Savage, one of the most distinguished experts on probability of modern times - warned researchers that Fisher's little recipe was "startlingly prone" to seeing significance in fluke results. 

Despite being published in the prestigious Psychological Review, it was a warning that went unheeded. Over the next 30 years, other statisticians also tried to sound the alarm bell, again without success. During the 1980s, Professor James Berger of Purdue Uni- versity - a world authority on Bayes's Theorem - published a entire series of papers again warning of the "astonishing" tendency of Fisher's P-values to exaggerate significance. Findings that met the 0.05 standard, said Berger, "Can actually arise when the data provide very little or no evidence in favour of an effect". Again, the warnings were ignored. 

In 1986, one scientist decided to take direct action against the failings of Fisher's methods. Professor Kenneth Rothman of the University of Massachusetts, editor of the well-respected American Journal of Public Health, told all researchers wanting to publish in the journal that he would no longer accept results based on P-values. 

It was a simple move that had a dramatic effect: the teaching in America's leading public health schools was transformed, with statistics courses revised to train students in alternatives to P-values. But two years later, when Rothman stepped down from the editorship, his ban on P-values was dropped - and researchers went back to their old ways. 

It has been a similar story in Britain. In 1995, the British Psychological Society and its counterpart in America quietly set up a working party to consider introducing a ban on P-values in its journals. The following year, it was disbanded - having made no decision. "It just sort of petered out", said one insider. "The view was that it would cause too much upheaval for the journals." 

Leading British medical journals have also examined the idea of banning P-values, but they too have pulled back. Instead, they merely suggest that researchers use other means of measuring significance. Yet these alternative methods are know to suffer similar flaws to P-values, exaggerating both the size of implausible effects and their significance. 

More than 30 years after the first warnings were sounded, it has become clear that the scientific community has no intention of dealing with the flaws in significance tests. Yet the evidence of those flaws is everywhere to be seen: flaky claims of health risks from a host of implausible causes, "wonder drugs" that lose their amazing abilities outside clinical trials, bizarre "links" between genetics and personality. 

A striking feature of the excuses given for the lack of action is that they centre on issues like "upheaval for our journals" and the "radical changes" needed in the training of scientists. Curiously for a profession supposedly dedicated to discovering truths, issues such as "reliability of research conclusions" are never mentioned. 

It is hard to avoid the conclusion that the real explanation for all the foot-dragging is not scientific at all. It is simply that if scientists abandon significance tests like P-values, many of their claims would be seen for what they really are: meaningless flukes on which tax-payers' money should never have been spent. 

The plain fact is that in 1925, Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. 


Click here to read Robert Matthews' full account of the issues raised in this article, "Facts versus Factions: the use and abuse of subjectivity in scientific research." At one time the article was also available from the European Science and Environment Forum, 4 Church Lane, Barton, Cambridge CB3 7BE, price £3.50 (UK) £3.75 (Europe), though I don't know that it's still available from this source.

Click here to visit Robert Matthews' web site.



Disclaimer: Throughout this website, statements are made pertaining to the properties and/or functions of food and/or nutritional products. These statements have not been evaluated by the Food and Drug Administration and these materials and products are not intended to diagnose, treat, cure or prevent any disease.