In case you missed it, the ‘big story’ in academic news in the past week was the retraction of more than 120 papers that had been published by Springer and the Institute of Electrical and Electronic Engineers (IEEE). The retraction followed the discovery, by Dr. Cyril Labbé of Joseph Fourier University, that all the papers in question had been generated by SCIgen, a computer programme that automatically produces nonsense academic papers.
Judging by the media reaction, public opinion seemed incredulous that such a thing could have happened. In this post, I argue that such embarrassments are common, and that they are not even the worst problem in academic publishing.
A history of hoaxes
Fake papers have regularly appeared in the scholarly record, often in order to demonstrate problems with the peer review process. For instance, in 1994, Alan Sokal famously submitted a paper to Social Text, in which he boldly fused string theory, Lacan and Derrida, and argued that quantum gravity had profound political implications. When the article was published, he revealed the hoax in a simultaneous publication, where he explained his rationale as follows:
For some years I’ve been troubled by an apparent decline in the standards of intellectual rigor in certain precincts of the American academic humanities. But I’m a mere physicist: if I find myself unable to make head or tail of jouissance and différance, perhaps that just reflects my own inadequacy.
So, to test the prevailing intellectual standards, I decided to try a modest (though admittedly uncontrolled) experiment: Would a leading North American journal of cultural studies – whose editorial collective includes such luminaries as Fredric Jameson and Andrew Ross – publish an article liberally salted with nonsense if (a) it sounded good and (b) it flattered the editors’ ideological preconceptions?
Since then, there have been reports of numerous hoax papers, which have tried to raise awareness of pseudo-academia, such as spamferences and predatory publishers. Most recently, Science reported on a massive ‘sting’ operation which used computer-generated variants of a fake paper, in an attempt to expose ‘predatory’ publishers. John Bohannon, the scientist behind the sting, reports how he created a “credible but mundane scientific paper”, which was filled with such “grave errors that a competent peer reviewer should easily identify it as flawed and unpublishable”. He then submitted 255 versions of the paper to various journals, resulting in no fewer than 157 publications.
Returning to this week’s case, in a statement issued immediately after the story went public, Springer has expressed confidence in the standards of peer review that they employ. In their words:
There will always be individuals who try to undermine existing processes in order to prove a point or to benefit personally. Unfortunately, scientific publishing is not immune to fraud and mistakes, either. The peer review system is the best system we have so far and this incident will lead to additional measures on the part of Springer to strengthen it.
The sentiment expressed by Springer may be due to the fact that all the papers that were identified seem to have been published in conference proceedings, which do not always adhere to the same standards of peer review that apply to research articles. Conference contributions are often judged on the merit of a short abstract, so that scholarly output can be rapidly disseminated. This allows academics to benefit from feedback from other conference participants, in order to develop ideas that are still rough around the edges into a ‘proper’ academic article. It is therefore possible for publishers like Springer to feel confident about the quality of their journal articles, while at the same time acknowledging the scope for more stringent processes in other areas.
Three problems at the heart of science
That having been said, there is reason to think that increased diligence by referees may not be enough. It seems that the SCIgen incident, and the Science ‘sting’ before it, and the numerous retractions regularly reported in sites such as Retraction Watch, are just symptoms of more serious, and deepening, crisis in scholarly communication. In an article originally published in the Guardian, Curt Rice describes three aspects of this crisis, namely increased retractions, low replicability and problematic measures of research quality.
Bad science gets published too often
First, the number of articles that are retracted seems to be increasing, with most retractions appearing in the more prestigious journals. It is unclear whether more retractions are a sign of increased malpractice or closer scrutiny, but either way, it seems that a good part of the academic record is tainted, and that peer-reviewers, editors and publishers are all to blame.
As academics who were involved in fraud are increasingly made to face the consequences (link no longer active) of their actions, I believe that similar standards of accountability should apply to everyone involved in the publishing process. For instance, the names of the referees who reviewed each article should be made available (as done by the Journal of Bone and Joint Surgery), so that we might be able to assign blame when the system breaks down. It has also been suggested that publishers should refund academics for those publications which clearly fail to meet academic standards. Whether or not such a measure would be feasible is up for debate, but it seems hard to argue against such idea from an ethical perspective.
We can’t be even certain about ‘sound’ papers
The second problem Rice highlights pertains to replicability. Academic journals are understandably keen to publish studies which report on surprising or unusual findings. However, sometimes these findings are just statistical flukes. In the social sciences, for example, the common threshold of statistical significance is p<0.05, which means that one finding in every 20 could just be a product of chance. This is not, in itself, a problem, because ‘science corrects itself’ as the adage goes. Since researchers (should) report their methods in sufficient detail for their study to be replicated, if follow-up research fails to reproduce an unusual finding, then the finding can be disregarded or even retracted.
In practice, there are at least three problems with this arrangement. First, a very large proportion of studies is impossible to replicate, either because they were conducted in the field rather than in controlled laboratory conditions, or because they report on complex phenomena that are sensitive to shifting conditions, or because the methods description is opaque. Secondly, replication studies that report on negative results are notoriously hard to publish: who would want to read, let along fund, a study reporting that nothing new was found? Although there are a few journals, like the Journal of Negative Results in Biomedicine, dedicated to publishing such studies, it does seem a lot of science goes unchecked. Lastly, it appears that even after research findings are challenged by a replication study, belief in the discredited results persists.
To me, all this seems to suggest a need for rethinking how research should be reported. This might involve more effective screening prior to publication, perhaps by requiring independent replication before a study is published. It could also involve greater transparency, e.g., by making datasets publicly available as per the new PLoS policy. And it should almost certainly involve better training and original thinking (e.g., reporting experiments in video format!).
We don’t measure the quality of science correctly
The final problem in Rice’s list is the prominence attached to the impact factor (IF). This is a metric which shows how often articles in a certain journal were cited in the past two years. The impact factor was originally designed to help acquisitions librarians manage subscriptions: Journals with high IFs were likely to contain more useful research, and were therefore assigned higher priority when making purchasing decisions.
There are many ways to artificially inflate a journal’s IF, but this may not be the main weakness of the system. Nor am I too concerned about the fact that papers published in high IF journals tend to be retracted more often than less prestigious publications. For me, what is particularly problematic is that the IF is also often used as a metric for indirectly assessing the quality of individual papers or researchers, enabling administrators and politicians to make judgements about research without actually needing to understand it. As Rice comments:
Politicians have a legitimate need to impose accountability, and while the ease of counting – something, anything – makes it tempting for them to infer quality from quantity, it doesn’t take much reflection to realize that this is a stillborn strategy.
The reasons why this is a stillborn strategy is that it is embarrassingly easy for bad quality papers to get published in journals with high IFs, as the string of hoaxes demonstrates; and it is even more common for good quality papers to be published in journals with a less-than-stellar IF (here’s some research to prove as much). It also seems that publishing in a high IF journal does not correlate well with tenure decisions, which suggests that the metric is not a useful way of judging individual researchers, either.
A much better way to evaluate the quality of individual papers would be to count the number of times that they have been cited. While shifting emphasis from the journal to the paper seems intuitive, it would mean that we should withhold judgement on papers for maybe several years, until research influenced by them has reached publication stage.
Difficult questions, few answers
It is an often-quoted truism that complaining about problems is easy, but proposing solutions is hard. If that is true, then it would perhaps be presumptuous of me to claim that I know what needs to be done. Sadly, I don’t.
But what does seem self-evident to me is that the common underlying cause of all three problems which Rice has identified is a culture of accountability, in which academics are placed under intense pressure to demonstrate that they are engaged in useful work. The fact that such pressure is not compatible with the careful reflective process that is requisite to quality research is perhaps obvious to academics. In the words of Jean Colpaert:
How many points would Louis Pasteur, Henri Poincaré, Claude Shannon, Tim Berners-Lee and others nowadays earn within the new academic evaluation system?
It may, however, be the case that this paradox has not been effectively communicated outside the Ivory Tower: maybe we should be doing a better job explaining why, and how, outside pressures to maximize performance are inhibiting research, to the detriment of everyone involved.
Featured Image: Nottingham University @ Flickr | CC BY-NC-SA