vs. Factions: the Use and Abuse of Subjectivity in Scientific Research
seemed no doubt about it: if you were going to have a heart attack,
there was never a better time than the early 1990s. Your chances of
survival appeared to be better than ever. Leading medical journals
were reporting results from new ways of treating heart attack victims
whose impact on death-rates wasn't just good--it was amazing.
trials in Scotland of a clot-busting drug called anistreplase suggested
that it could double the chances of survival. A year later, another
"miracle cure" emerged: injections of magnesium, which studies suggested
could also double survival rates. Leading cardiologists hailed the
injections as an "effective, safe, simple and inexpensive" treatment
that could save the lives of thousands.
something odd began to happen. In 1995, the Lancet published the results
of a huge international study of heart attack survival rates among
58,000 patients - and the amazing life-saving abilities of magnesium
injections had simply vanished. Anistreplase fared little better:
the current view is that its real effectiveness is barely half that
suggested by the original trial.
long war against Britain's single biggest killer, a few disappointments
are obviously inevitable. And over the last decade or so, scientists
have identified other heart attack treatments which in trials reduced
mortality by up to 30 percent.
something odd seems to be happening. Once these drugs get out of clinical
trials and onto the wards, they too seem to lose their amazing abilities.
year, Dr Nigel Brown and colleagues at Queen's Medical Centre in Nottingham
published a comparison of death rates among heart attack patients
for 1989-1992 and those back in the clinical "Dark Ages" of 1982-4,
before such miracles as thrombolytic therapy had shown success in
trials. Their aim was to answer a simple question: just what impact
have these "clinically proven" treatments had on death rates out on
by the trial results, the wonder treatments should have led to death
rates on the wards of just 10 percent or so. What Dr Brown and his
colleagues actually found was, to put it mildly, disconcerting. Out
on the wards, the wonder drugs seem to be having no effect at all.
In 1982, the death rate among patients admitted with heart attacks
was about 20 percent. Ten years on, it was the same: 20 percent -
double the death rate predicted by the clinical trials.
search for explanations, Dr Brown and his colleagues pointed to the
differences between patients in clinical trials - who tend to be hand-picked
and fussed over by leading experts - and the ordinary punter who ends
up in hospital wards. They also suggested that delays in patients
arriving on the wards might be preventing the wonder drugs from showing
their true value.
which would seem perfectly reasonable - except that heart attack therapies
are not the only "breakthroughs" that are proving to be damp squibs
out in the real world.
the years, cancer experts have seen a host of promising drugs dismally
fail once outside clinical trials. In 1986, an analysis of cancer
death rates in the New England Journal of Medicine concluded
that "Some 35 years of intense effort focused largely on improving
treatment must be judged a qualified failure". Last year, the same
journal carried an update: "With 12 more years of data and experience",
the authors said, "We see little reason to change that conclusion".
investigating supposed links between ill-health and various "risk
factors" have seen the same thing: impressive evidence of a "significant"
risk - which then vanishes again when others try to confirm its existence.
Leukaemias and overhead pylons, connective tissue disease and silicone
breast implants, salt and high blood pressure: all have an impressive
heap of studies pointing to a significant risk - and an equally impressive
heap saying it isn't there.
the same story beyond the medical sciences, in fields from psychology
to genetics: amazing results discovered by reputable research groups
which then vanish again when others try to replicate them.
effort has been spent trying to explain these mysterious cases of
The Vanishing Breakthrough. Over-reliance on data from tiny samples,
the reluctance of journals to print negative findings from early studies,
outright cheating: all have been put forward as possible suspects.
most likely culprit has long been known to statisticians. A clue to
its identity comes from the one feature all of these scientific disciplines
have in common: they all rely on so-called "significance tests" to
gauge the importance of their findings.
developed in the 1920s, these tests are routinely used throughout
the scientific community. Thousands of scientific papers and millions
of pounds of research funding have been based on their conclusions.
They are ubiquitous and easy to use. And they are fundamentally and
to analyse clinical trials, these textbook techniques can easily double
the apparent effectiveness of a new drug and turn a borderline result
into a highly "significant" breakthrough. They can throw up convincing
yet utterly spurious evidence for "links" between diseases and any
number of supposed causes. They can even lend impressive support to
claims for the existence of the paranormal.
suggestion that these basic flaws in such widely-used techniques could
have been missed for so long is astonishing. Alto- gether more astonishing,
however, is the fact that the scientific community has been repeatedly
warned about these flaws - and has ignored the warnings.
result, thousands of research papers are being published every year
whose conclusions are based on techniques known to be unreliable.
The time and effort - and public money - wasted in trying to confirm
the consequent spurious findings is one of the great scientific scandals
of our time.
of this scandal are deep, having their origins in the work of an English
mathematician and cleric named Thomas Bayes, published over 200 years
ago. In his "Essay Towards Solving a Problem in the Doctrine of Chances",
Bayes gave a mathematical recipe of astonishing power. Put simply,
it shows how we should change our belief in a theory in the light
of new evidence.
not need to be a statistician to see the fundamental impor- tance
of "Bayes's Theorem" for scientific research. From studies of the
cosmos to trials of cancer drugs, all research is ultimately about
finding out how we should change our belief in a theory as new data
150 years, Bayes's Theorem formed the foundation of statistical science,
allowing researchers to assess the meaning of new results. But during
the early part of this century, a number of influential mathematicians
and philosophers began to raise objections to Bayes's Theorem. The
most damning was also the simplest: different people could use Bayes's
Theorem and get different results.
with the same experimental evidence for, say, ESP, true believers
could use Bayes's Theorem to claim that the new results implied that
telepathy is almost certainly real. Skeptics, in contrast, could use
Bayes's Theorem to insist they were still not convinced.
views are possible because Bayes's Theorem shows only how to alter
one's prior level of belief - and different people can start out with
this may not seem like an egregious failing at all: what one person
sees as convincing evidence may obviously fail to impress others.
No matter: the fact that Bayes's Theorem could lead different people
to different conclusions led to its being inextricably linked to the
most rebarbative concept known to scientists: subjectivity.
hard to convey the emotions roused within the scientific community
by the S-word. Subjectivity is seen as the barbarian at the gates
of science, the enemy of objective truth, the destroyer of insight.
It is seen as the mind-virus that has turned the humanities into an
intellectual free-for-all, where the idea of "progress" is dismissed
as bourgeois, and the belief in "facts" naïve. Once allowed into
the citadel of science, runs the argument, subjectivity would turn
all research into glorified literary criticism.
1920s, Bayes's Theorem had all but been declared heretical - which
created a problem: what were scientists going to replace it with?
The answer came from one of Bayes's most brilliant critics: the Cambridge
mathematician and geneticist, Ronald Aylmer Fisher - father of modern
had greater need of a replacement for Bayes than Fisher, who frequently
worked with complex data from plant breeding trials. Drawing on his
great mathematical ability, he set about finding a new and completely
objective way of drawing conclusions from experiments. By 1925, he
believed he had succeeded, and pub- lished his techniques in a book,
"Statistical Methods for Research Workers". It was to become one of
the most influential texts in the history of science, and laid the
foundations for virtually all the statistics now used by scientists.
face of it, Fisher had achieved what Bayes claimed was impossible:
he had found a way of judging the "significance" of experimental data
entirely objectively. That is, he had found a way that anyone could
use to show that a result was too impressive to be dismissed as a
had to do, said Fisher, was to convert their raw data into something
called a P-value, a number giving the probability of getting at least
as impressive results as those seen by chance alone. If this P- value
is below 1 in 20, or 0.05, said Fisher, it was safe to conclude that
a finding really was "significant".
simplicity with apparent objectivity, Fisher's P-value method was
an immediate hit with the scientific community. Its popularity endures
to this day. Open any leading scientific journal and you will see
the phrase "P < 0.05" - the hallmark of a significant finding -
in papers on every conceivable area of research, from astronomy to
zoology. Every year, new statistics textbooks appear to explain Fisher's
simple little recipe to a new generation of researchers.
as scientists were adopting P-values, a few awkward questions started
to be asked by other statisticians. The most telling was raised by
the distinguished Cambridge mathematician Harold Jeffreys. Writing
in his own treatise on statistics, Theory of Probability, published
in 1939, Jeffreys asked an obvious question: just why should the dividing
line for significance be set at Fisher's value of 0.05?
seemingly innocuous question has profound implications, for Fisher's
figure of 0.05 is still the sine qua non for deciding if a scientific
result is "significant". All scientists know that if their experiment
gives a P-value meeting Fisher's standard they are on their way to
having a publishable paper. Fisher's standard is even more important
for pharmaceutical companies, as national regulatory organisations
still use Fisher's 0.05 figure to decide whether to approve a new
drug for general release. Getting drug trial results with P-values
that beat Fisher's standard can thus make the difference between millions
in profits or bankruptcy.
what were the brilliant insights that led Fisher to choose that talismanic
figure of 0.05, on which so much scientific research has since stood
or fallen? Incredibly, as Fisher himself admitted, there weren't any.
He simply decided on 0.05 because it was mathe- matically convenient.
of this are truly disturbing. It means that key scientific questions
such as whether a new heart drug is seen as effective or whether diet
really is linked to cancer are being decided by an entirely arbitrary
standard chosen over 70 years ago for mathematical "convenience".
would not matter if Fisher had been lucky, and chosen a figure that
makes the risk of being fooled by a fluke result very low. Yet statisticians
now know that his choice was a particularly bad one - and that many
supposedly "significant" findings are in fact entirely spurious.
hints of this deeply worrying feature of Fisher's methods first emerged
as long ago as the early 1960s, following a resurgence of interest
in Bayes's Theorem. Many of the supposedly "insuperable" objections
to its use were shown to be baseless, and the theorem has since emerged
as one of the axioms of the entire theory of probability. As such,
its implications for statistics cannot be wished away - no matter
how noisome scientists might find them.
most important of those implications is that - as Bayes himself had
insisted 200 years ago - it is indeed impossible to judge the "significance"
of data in isolation. Crucially, the plausibility of the data has
to be taken into account.
Bayes's Theorem, a number of leading statisticians began to probe
the reliability of P-values as a measure of significance. What they
discovered could hardly be more serious.
face of it, Fisher's standard of 0.05 suggests that the chances of
a mere fluke being the real explanation for a given result is just
5 in 100 - plenty of protection against being fooled. But in 1963,
a team of statisticians at the University of Michigan showed that
the actual chances of being fooled could easily be 10 times higher.
Because it fails to take into account plausibility, Fisher's test
can see "significance" in results which are actually over 50 percent
likely to be utter nonsense.
- which included Professor Leonard Savage, one of the most distinguished
experts on probability of modern times - warned researchers that Fisher's
little recipe was "startlingly prone" to seeing significance in fluke
being published in the prestigious Psychological Review, it was a
warning that went unheeded. Over the next 30 years, other statisticians
also tried to sound the alarm bell, again without success. During
the 1980s, Professor James Berger of Purdue Uni- versity - a world
authority on Bayes's Theorem - published a entire series of papers
again warning of the "astonishing" tendency of Fisher's P-values to
exaggerate significance. Findings that met the 0.05 standard, said
Berger, "Can actually arise when the data provide very little or no
evidence in favour of an effect". Again, the warnings were ignored.
one scientist decided to take direct action against the failings of
Fisher's methods. Professor Kenneth Rothman of the University of Massachusetts,
editor of the well-respected American Journal of Public Health, told
all researchers wanting to publish in the journal that he would no
longer accept results based on P-values.
a simple move that had a dramatic effect: the teaching in America's
leading public health schools was transformed, with statistics courses
revised to train students in alternatives to P-values. But two years
later, when Rothman stepped down from the editorship, his ban on P-values
was dropped - and researchers went back to their old ways.
been a similar story in Britain. In 1995, the British Psychological
Society and its counterpart in America quietly set up a working party
to consider introducing a ban on P-values in its journals. The following
year, it was disbanded - having made no decision. "It just sort of
petered out", said one insider. "The view was that it would cause
too much upheaval for the journals."
British medical journals have also examined the idea of banning P-values,
but they too have pulled back. Instead, they merely suggest that researchers
use other means of measuring significance. Yet these alternative methods
are know to suffer similar flaws to P-values, exaggerating both the
size of implausible effects and their significance.
than 30 years after the first warnings were sounded, it has become
clear that the scientific community has no intention of dealing with
the flaws in significance tests. Yet the evidence of those flaws is
everywhere to be seen: flaky claims of health risks from a host of
implausible causes, "wonder drugs" that lose their amazing abilities
outside clinical trials, bizarre "links" between genetics and personality.
feature of the excuses given for the lack of action is that they centre
on issues like "upheaval for our journals" and the "radical changes"
needed in the training of scientists. Curiously for a profession supposedly
dedicated to discovering truths, issues such as "reliability of research
conclusions" are never mentioned.
hard to avoid the conclusion that the real explanation for all the
foot-dragging is not scientific at all. It is simply that if scientists
abandon significance tests like P-values, many of their claims would
be seen for what they really are: meaningless flukes on which tax-payers'
money should never have been spent.
fact is that in 1925, Ronald Fisher gave scientists a mathematical
machine for turning baloney into breakthroughs, and flukes into funding.
It is time to pull the plug.
here to read Robert Matthews' full account of the issues raised
in this article, "Facts versus Factions: the use and abuse of subjectivity
in scientific research." At one time the article was also available
from the European Science and Environment
Forum, 4 Church Lane, Barton, Cambridge CB3 7BE, price
£3.50 (UK) £3.75 (Europe), though I don't know that it's
still available from this source.
here to visit Robert Matthews' web site.
Throughout this website, statements are made pertaining to the properties
and/or functions of food and/or nutritional products. These statements
have not been evaluated by the Food and Drug Administration and
these materials and products are not intended to diagnose, treat,
cure or prevent any disease.