There seemed no
doubt about it: if you were going to have a heart attack, there was never a better
time than the early 1990s. Your chances of survival appeared to be better than
ever. Leading medical journals were reporting results from new ways of treating
heart attack victims whose impact on death-rates wasn't just good--it was amazing.
In 1992,
trials in Scotland of a clot-busting drug called anistreplase suggested that it
could double the chances of survival. A year later, another "miracle cure" emerged:
injections of magnesium, which studies suggested could also double survival rates.
Leading cardiologists hailed the injections as an "effective, safe, simple and
inexpensive" treatment that could save the lives of thousands.
But
then something odd began to happen. In 1995, the Lancet published the results
of a huge international study of heart attack survival rates among 58,000 patients
- and the amazing life-saving abilities of magnesium injections had simply vanished.
Anistreplase fared little better: the current view is that its real effectiveness
is barely half that suggested by the original trial.
In
the long war against Britain's single biggest killer, a few disappointments are
obviously inevitable. And over the last decade or so, scientists have identified
other heart attack treatments which in trials reduced mortality by up to 30 percent.
But again,
something odd seems to be happening. Once these drugs get out of clinical trials
and onto the wards, they too seem to lose their amazing abilities.
Last year, Dr Nigel
Brown and colleagues at Queen's Medical Centre in Nottingham published a comparison
of death rates among heart attack patients for 1989-1992 and those back in the
clinical "Dark Ages" of 1982-4, before such miracles as thrombolytic therapy had
shown success in trials. Their aim was to answer a simple question: just what
impact have these "clinically proven" treatments had on death rates out on the
wards?
Judging
by the trial results, the wonder treatments should have led to death rates on
the wards of just 10 percent or so. What Dr Brown and his colleagues actually
found was, to put it mildly, disconcerting. Out on the wards, the wonder drugs
seem to be having no effect at all. In 1982, the death rate among patients admitted
with heart attacks was about 20 percent. Ten years on, it was the same: 20 percent
- double the death rate predicted by the clinical trials.
In
the search for explanations, Dr Brown and his colleagues pointed to the differences
between patients in clinical trials - who tend to be hand-picked and fussed over
by leading experts - and the ordinary punter who ends up in hospital wards. They
also suggested that delays in patients arriving on the wards might be preventing
the wonder drugs from showing their true value.
All
of which would seem perfectly reasonable - except that heart attack therapies
are not the only "breakthroughs" that are proving to be damp squibs out in the
real world.
Over
the years, cancer experts have seen a host of promising drugs dismally fail once
outside clinical trials. In 1986, an analysis of cancer death rates in the New
England Journal of Medicine concluded that "Some 35 years of intense effort
focused largely on improving treatment must be judged a qualified failure". Last
year, the same journal carried an update: "With 12 more years of data and experience",
the authors said, "We see little reason to change that conclusion".
Scientists investigating
supposed links between ill-health and various "risk factors" have seen the same
thing: impressive evidence of a "significant" risk - which then vanishes again
when others try to confirm its existence. Leukaemias and overhead pylons, connective
tissue disease and silicone breast implants, salt and high blood pressure: all
have an impressive heap of studies pointing to a significant risk - and an equally
impressive heap saying it isn't there.
It
is the same story beyond the medical sciences, in fields from psychology to genetics:
amazing results discovered by reputable research groups which then vanish again
when others try to replicate them.
Much
effort has been spent trying to explain these mysterious cases of The Vanishing
Breakthrough. Over-reliance on data from tiny samples, the reluctance of journals
to print negative findings from early studies, outright cheating: all have been
put forward as possible suspects.
Yet
the most likely culprit has long been known to statisticians. A clue to its identity
comes from the one feature all of these scientific disciplines have in common:
they all rely on so-called "significance tests" to gauge the importance of their
findings.
First
developed in the 1920s, these tests are routinely used throughout the scientific
community. Thousands of scientific papers and millions of pounds of research funding
have been based on their conclusions. They are ubiquitous and easy to use. And
they are fundamentally and dangerously flawed.
Used
to analyse clinical trials, these textbook techniques can easily double the apparent
effectiveness of a new drug and turn a borderline result into a highly "significant"
breakthrough. They can throw up convincing yet utterly spurious evidence for "links"
between diseases and any number of supposed causes. They can even lend impressive
support to claims for the existence of the paranormal.
The
very suggestion that these basic flaws in such widely-used techniques could have
been missed for so long is astonishing. Alto- gether more astonishing, however,
is the fact that the scientific community has been repeatedly warned about these
flaws - and has ignored the warnings.
As
a result, thousands of research papers are being published every year whose conclusions
are based on techniques known to be unreliable. The time and effort - and public
money - wasted in trying to confirm the consequent spurious findings is one of
the great scientific scandals of our time.
The
roots of this scandal are deep, having their origins in the work of an English
mathematician and cleric named Thomas Bayes, published over 200 years ago. In
his "Essay Towards Solving a Problem in the Doctrine of Chances", Bayes gave a
mathematical recipe of astonishing power. Put simply, it shows how we should change
our belief in a theory in the light of new evidence.
One
does not need to be a statistician to see the fundamental impor- tance of "Bayes's
Theorem" for scientific research. From studies of the cosmos to trials of cancer
drugs, all research is ultimately about finding out how we should change our belief
in a theory as new data emerge.
For
over 150 years, Bayes's Theorem formed the foundation of statistical science,
allowing researchers to assess the meaning of new results. But during the early
part of this century, a number of influential mathematicians and philosophers
began to raise objections to Bayes's Theorem. The most damning was also the simplest:
different people could use Bayes's Theorem and get different results.
Faced with the
same experimental evidence for, say, ESP, true believers could use Bayes's Theorem
to claim that the new results implied that telepathy is almost certainly real.
Skeptics, in contrast, could use Bayes's Theorem to insist they were still not
convinced.
Both
views are possible because Bayes's Theorem shows only how to alter one's prior
level of belief - and different people can start out with different opinions.
To non-scientists,
this may not seem like an egregious failing at all: what one person sees as convincing
evidence may obviously fail to impress others. No matter: the fact that Bayes's
Theorem could lead different people to different conclusions led to its being
inextricably linked to the most rebarbative concept known to scientists: subjectivity.
It is hard
to convey the emotions roused within the scientific community by the S-word. Subjectivity
is seen as the barbarian at the gates of science, the enemy of objective truth,
the destroyer of insight. It is seen as the mind-virus that has turned the humanities
into an intellectual free-for-all, where the idea of "progress" is dismissed as
bourgeois, and the belief in "facts" naïve. Once allowed into the citadel
of science, runs the argument, subjectivity would turn all research into glorified
literary criticism.
By
the 1920s, Bayes's Theorem had all but been declared heretical - which created
a problem: what were scientists going to replace it with? The answer came from
one of Bayes's most brilliant critics: the Cambridge mathematician and geneticist,
Ronald Aylmer Fisher - father of modern statistics.
Few
scientists had greater need of a replacement for Bayes than Fisher, who frequently
worked with complex data from plant breeding trials. Drawing on his great mathematical
ability, he set about finding a new and completely objective way of drawing conclusions
from experiments. By 1925, he believed he had succeeded, and pub- lished his techniques
in a book, "Statistical Methods for Research Workers". It was to become one of
the most influential texts in the history of science, and laid the foundations
for virtually all the statistics now used by scientists.
On
the face of it, Fisher had achieved what Bayes claimed was impossible: he had
found a way of judging the "significance" of experimental data entirely objectively.
That is, he had found a way that anyone could use to show that a result was too
impressive to be dismissed as a fluke.
All
scientists had to do, said Fisher, was to convert their raw data into something
called a P-value, a number giving the probability of getting at least as impressive
results as those seen by chance alone. If this P- value is below 1 in 20, or 0.05,
said Fisher, it was safe to conclude that a finding really was "significant".
Combining
simplicity with apparent objectivity, Fisher's P-value method was an immediate
hit with the scientific community. Its popularity endures to this day. Open any
leading scientific journal and you will see the phrase "P < 0.05" - the hallmark
of a significant finding - in papers on every conceivable area of research, from
astronomy to zoology. Every year, new statistics textbooks appear to explain Fisher's
simple little recipe to a new generation of researchers.
But
just as scientists were adopting P-values, a few awkward questions started to
be asked by other statisticians. The most telling was raised by the distinguished
Cambridge mathematician Harold Jeffreys. Writing in his own treatise on statistics,
Theory of Probability, published in 1939, Jeffreys asked an obvious question:
just why should the dividing line for significance be set at Fisher's value of
0.05?
This
seemingly innocuous question has profound implications, for Fisher's figure of
0.05 is still the sine qua non for deciding if a scientific result is "significant".
All scientists know that if their experiment gives a P-value meeting Fisher's
standard they are on their way to having a publishable paper. Fisher's standard
is even more important for pharmaceutical companies, as national regulatory organisations
still use Fisher's 0.05 figure to decide whether to approve a new drug for general
release. Getting drug trial results with P-values that beat Fisher's standard
can thus make the difference between millions in profits or bankruptcy.
So just
what were the brilliant insights that led Fisher to choose that talismanic figure
of 0.05, on which so much scientific research has since stood or fallen? Incredibly,
as Fisher himself admitted, there weren't any. He simply decided on 0.05 because
it was mathe- matically convenient.
The
implications of this are truly disturbing. It means that key scientific questions
such as whether a new heart drug is seen as effective or whether diet really is
linked to cancer are being decided by an entirely arbitrary standard chosen over
70 years ago for mathematical "convenience".
This
would not matter if Fisher had been lucky, and chosen a figure that makes the
risk of being fooled by a fluke result very low. Yet statisticians now know that
his choice was a particularly bad one - and that many supposedly "significant"
findings are in fact entirely spurious.
The
first hints of this deeply worrying feature of Fisher's methods first emerged
as long ago as the early 1960s, following a resurgence of interest in Bayes's
Theorem. Many of the supposedly "insuperable" objections to its use were shown
to be baseless, and the theorem has since emerged as one of the axioms of the
entire theory of probability. As such, its implications for statistics cannot
be wished away - no matter how noisome scientists might find them.
And the most important
of those implications is that - as Bayes himself had insisted 200 years ago -
it is indeed impossible to judge the "significance" of data in isolation. Crucially,
the plausibility of the data has to be taken into account.
Using
Bayes's Theorem, a number of leading statisticians began to probe the reliability
of P-values as a measure of significance. What they discovered could hardly be
more serious.
On
the face of it, Fisher's standard of 0.05 suggests that the chances of a mere
fluke being the real explanation for a given result is just 5 in 100 - plenty
of protection against being fooled. But in 1963, a team of statisticians at the
University of Michigan showed that the actual chances of being fooled could easily
be 10 times higher. Because it fails to take into account plausibility, Fisher's
test can see "significance" in results which are actually over 50 percent likely
to be utter nonsense.
The
team - which included Professor Leonard Savage, one of the most distinguished
experts on probability of modern times - warned researchers that Fisher's little
recipe was "startlingly prone" to seeing significance in fluke results.
Despite
being published in the prestigious Psychological Review, it was a warning that
went unheeded. Over the next 30 years, other statisticians also tried to sound
the alarm bell, again without success. During the 1980s, Professor James Berger
of Purdue Uni- versity - a world authority on Bayes's Theorem - published a entire
series of papers again warning of the "astonishing" tendency of Fisher's P-values
to exaggerate significance. Findings that met the 0.05 standard, said Berger,
"Can actually arise when the data provide very little or no evidence in favour
of an effect". Again, the warnings were ignored.
In
1986, one scientist decided to take direct action against the failings of Fisher's
methods. Professor Kenneth Rothman of the University of Massachusetts, editor
of the well-respected American Journal of Public Health, told all researchers
wanting to publish in the journal that he would no longer accept results based
on P-values.
It
was a simple move that had a dramatic effect: the teaching in America's leading
public health schools was transformed, with statistics courses revised to train
students in alternatives to P-values. But two years later, when Rothman stepped
down from the editorship, his ban on P-values was dropped - and researchers went
back to their old ways.
It
has been a similar story in Britain. In 1995, the British Psychological Society
and its counterpart in America quietly set up a working party to consider introducing
a ban on P-values in its journals. The following year, it was disbanded - having
made no decision. "It just sort of petered out", said one insider. "The view was
that it would cause too much upheaval for the journals."
Leading
British medical journals have also examined the idea of banning P-values, but
they too have pulled back. Instead, they merely suggest that researchers use other
means of measuring significance. Yet these alternative methods are know to suffer
similar flaws to P-values, exaggerating both the size of implausible effects and
their significance.
More
than 30 years after the first warnings were sounded, it has become clear that
the scientific community has no intention of dealing with the flaws in significance
tests. Yet the evidence of those flaws is everywhere to be seen: flaky claims
of health risks from a host of implausible causes, "wonder drugs" that lose their
amazing abilities outside clinical trials, bizarre "links" between genetics and
personality.
A
striking feature of the excuses given for the lack of action is that they centre
on issues like "upheaval for our journals" and the "radical changes" needed in
the training of scientists. Curiously for a profession supposedly dedicated to
discovering truths, issues such as "reliability of research conclusions" are never
mentioned.
It
is hard to avoid the conclusion that the real explanation for all the foot-dragging
is not scientific at all. It is simply that if scientists abandon significance
tests like P-values, many of their claims would be seen for what they really are:
meaningless flukes on which tax-payers' money should never have been spent.
The plain
fact is that in 1925, Ronald Fisher gave scientists a mathematical machine for
turning baloney into breakthroughs, and flukes into funding. It is time to pull
the plug.
Click
here to read Robert Matthews' full account of the issues raised in this
article, "Facts versus Factions: the use and abuse of subjectivity in scientific
research." At one time the article was also available from the European
Science and Environment Forum, 4 Church Lane, Barton, Cambridge CB3 7BE,
price £3.50 (UK) £3.75 (Europe), though I don't know that it's still
available from this source.
Disclaimer:
Throughout this website, statements are made pertaining to the properties and/or
functions of food and/or nutritional products. These statements have not been
evaluated by the Food and Drug Administration and these materials and products
are not intended to diagnose, treat, cure or prevent any disease.