When everyone’s super… On gaming the system

Syndrome: Oh, I’m real. Real enough to defeat you! And I did it without your precious gifts, your oh-so-special powers. I’ll give them heroics. I’ll give them the most spectacular heroics the world has ever seen! And when I’m old and I’ve had my fun, I’ll sell my inventions so that everyone can have powers. Everyone can be super! And when everyone’s super… [chuckles evilly] no one will be.

The Incredibles

Here’s a funny little story about how a highly specialised journal gamed journal impact measurements:

 

The Swiss journal Folia Phoniatrica et Logopaedica has a good reputation among voice researchers but, with an impact factor of 0.655 in 2007, publication in it was unlikely to bring honour or grant money to the authors’ institutions.
Now two investigators, one Dutch and one Czech, have taken on the system and fought back. They published a paper called ‘Reaction of Folia Phoniatrica et Logopaedica on the current trend of impact factor measures’ (H. K. Schutte and J. G. Švec Folia Phoniatr. Logo.59, 281–285; 2007). This cited all the papers published in the journal in the previous two years. As ‘impact factor’ is defined as the number of citations to articles in a journal in the past two years, divided by the total number of papers published in that journal over the same period, their strategy dramatically increased Folia‘s impact factor this year to 1.439.

In the ‘rehabilitation’ category, shared with 26 other journals, Folia jumped from position 22 to position 13.

Tomáš Opatrný. Playing the system to give low-impact
journal more clout. Nature 455, 167 (11 September 2008)
.

Assessing impact

Assessing (and hence demonstrating) impact is a difficult but important problem in contemporary academia.

For most of the last century, university researchers have been evaluated on their ability to “write something and get it into print… ‘publish or perish'” (as Logan Wilson put it as early as 1942 in The Academic Man: A Study in the Sociology of a Profession, one of the first print citations of the term).

As you might expect, the development of a reward system built on publication led to a general increase in number of publications. Studies of science publication suggest a growth rate in the number of scientific articles and journals of between 2 and 5% per year since 1907 (a rate that leads to doubling roughly every 15 years). There is also evidence for a particularly marked rise in numbers after the 1950s.

This kind of growth vitiates the original point of the metric. If everybody publishes all the time, then the simple fact of publication is no longer sufficient as a proxy for excellence. You could count the sheer number of publications—a measure that is in fact widely used in popular contexts to imply productivity—were it not so obviously open to abuse: unless you institute some kind of control over the type and quality of publication, a system that simply counts publications will lead inevitably to an increase in number, and a corresponding decrease in quality, originality, and length.

Origins of peer review

It is perhaps for this reason that modern peer review systems begin to be institutionalised in the course of the second half of the last century. In fact, while peer review is probably understood to be the sine qua non of university research, and while it is possible to trace sporadic examples of activity resembling peer review back into the classical period, peer review in its modern form in fact really only begins to take shape only in the period from the 1940s-1970s. Major scientific journals, including Science and The Journal of the American Medical Association, for example, begin to make systematic use of external reviewers only in the 1940s, partially as an apparent response to the growing number and specialisation of submissions.

As you might expect, the peer review/reward system has itself been gamed. In the same way a reward system built on counting publications leads inevitably to an increase in the number of publications, a reward system build on counting peer-reviewed publications leads, inevitably, to an increase in the number of peer-reviewed publications… and the size and number of the journals that publish them.

Impact measures

Journal impact measurements are a controversial response to the not-surprising fact that peer review has also become an insufficient proxy for excellence. It is still relatively early days in this area (though less so in the natural sciences) and there is as yet not a complete consensus as to how impact should be quantified. As a result, the measures can still take many forms, from lists of ranked journals, to citation counts, to circulation and aggregation statistics, to in the case of on-line journals even more difficult-to-interpret statistics like bounce and exit rates.

Regardless of how the impact factor debate settles out, however, it is only a matter of time until it too is gamed. Indeed, as the example of Folia Phoniatrica et Logopaedica suggests, it even may not be a matter of time. If you count citations, researchers will start ensuring they get cited. If you rank journals, they will ensure their journals fit your ranking criteria. If you privilege aggregation, the aggregators will be flooded with candidates for aggregation. And it is not clear that commercial understandings of good web analytics are really appropriate for scholarly and scientific publishing.

Is gaming the system wrong?

But the Folia Phoniatrica et Logopaedica example is also interesting because I’m not sure it is a bad thing. I can’t independently assess Opatrný’s claim that the journal is well respected though faring badly in impact measurements, but it wouldn’t surprise me if he was right. And the fact that a single researcher in a single article was able to more than double his journal’s impact score simply by citing every paper published in the journal in the previous two years leaves me… quite happy for him. I doubt there are many people who would consider the article cited by Opatrný to be in some way fraudulent. Instead, I suspect most of us consider it evidence (at best) that there are still some bugs in the system and (at worst) of a successful reductio ad absurdum–similar in a certain sense to Alan Sokol’s submission to Social Text.

How impact measures improve things

None of this means that impact metrics are an intrinsically bad thing. Or that peer review isn’t good. Or that researchers shouldn’t be expected to publish. In fact, in many ways, the introduction of these various metrics, and the emphasis they receive in academia, is very good. Peer review has become almost fully institutionalised in the humanities in the course of my career. When I was a graduate student in the early 1990s, most journals I submitted to did not have formal explanation of their review policies and many were probably not, strictly speaking, peer reviewed.  But it was difficult to tell and nobody I knew even attempted to distinguish publications on their CVs on the basis of whether or not they were peer reviewed. We were taught to distinguish publications (and the primary metric was still number of publications) on the basis of genre: you separated reviews from encyclopedia entries from notes from lengthy articles. A review didn’t count for much, even if we could have shown it was peer reviewed, and a lengthy article in what “everybody knew” to be a top journal counted for a lot, whether it was peer reviewed or not.

By the time I was department chair, 10 years later, faculty members were presenting me with CVs that distinguished output on the basis of peer review status. In these cases, genre was less important that peer review status. Reviews that were peer-reviewed were listed above articles that weren’t and journals began being quite explicit about their reviewing policies. The journal I helped found, Digital Medievalist, began from its first issue with what we described as “ostentatious peer review”: we named the referees who recommended acceptance on every article, partially as a way of borrowing their prestige for what we thought was, at the time, a fairly daring experiment in open access publication.

But we did this also because we thought (and think) that peer review is a good thing. My peer reviewed articles are, in almost every case, without a doubt better written and especially better and more carefully argued than my non-peer-reviewed articles. I’ve had stupid comments from referees (though none as stupid as seems to be the norm on grant applications), but there is only one case I can think of where I really couldn’t see how satisfying what the referee wanted wouldn’t improve things.

And the same is true for publication frequency. On the whole, my experience is that people who publish more (within a given discipline) also tend to publish better. I don’t publish too badly for somebody in my discipline. But most of the people who publish more than me in that same discipline are people I’d like to emulate. It is possible to game publication frequency; but on the whole, even the people who (I think) game it are among our most productive and most interesting scholars anyway: they’d still be interesting and productive even if they weren’t good at spinning material for one article into three.

So what does it mean that Schutte and Švec were able to game the impact measure of their journal with such apparent ease? And what should we say in response to the great uproar (much of it in my view well-founded) about the introduction of journal ranking lists by the ESF and Australian governments in recent years? Obviously some journals simply are better than others–more prestigious, better edited, more influential, containing more important papers. And it is difficult to see how frequency of citation is a bad thing, even if its absence is not necessarily evidence something is not good or not important. I would still rather have a heavily cited article in the PMLA than an article nobody read in a journal nobody has ever heard of.

More ‘guidelines’ than ‘rules’

Perhaps the most important thing is that it suggests, as Barbossa says to Miss Turner in Pirates of the Caribbean concerning the “Pirates’ Code,” that these kind of metrics should really be considered “more what you’d call ‘guidelines’ than actual rules.” Journals (and articles), that have a high impact factor, lots of citations, and are heavily read, are probably to be celebrated. But impact, citations, and subscription are not in themselves sufficient proxies for quality: we should expect to find equally good articles, journals, and scholars to exist with lower numbers in all these areas. And more importantly, we should expect to find that any quantifiable criteria we do establish will almost immediately be gamed by researchers in the field: most people with PhD-level research positions got where they are, after all, because they were pretty good at producing what examiners wanted to hear.

The real issue, then, is that metrics like “impact” or “peer review” or even “quantity” are attempts to use quantitative values as substitutes for qualitative assessment. The only real way of assessing quality is through qualitative assessment: that is to say by assessing a work on its own merits in relation to the goals it sets itself in terms of audience, impact, and subject matter, including the reasonableness of these goals. An article by an author who is not famous, in an obscure field, in a on-line journal that has no subscribers, and is not frequently cited may or may not represent poor quality work–in much the same way as might a frequently cited article in a popular field in a journal that is published by a famous academic, in the journal of the main scholarly society in a discipline. What is (or should be) important to the assessor is how reasonably each author has defined his or her goals and how well the resulting work has done in relation to those goals.

And this is where academics’ ability to game any other system becomes a virtue. Since there is no single metric we can create that researchers as a group will not figure out how to exploit (and then in short order), we should accept that we will simply never be able to propose a quantitative measurement for assessing intrinsic quality. What we can rely on, however, is that researchers will, on the whole, try to present their work in its best light. By asking the researchers to explain how their work can be best assessed, and being willing to evaluate that both that explanation and the degree to which the work meets the proposed criteria, we can find a way of comparatively evaluating excellence. Journals, articles, and researchers, that define, then meet or exceed reasonable targets for their disciplines and types of work, are excellent. Those that don’t, aren’t.

And in the meantime, we’ll develop far more innovative measurements of quality.


Follow

Get every new post delivered to your Inbox

Join other followers: