This post has been prompted by an edited collection that I was recently asked to review. Substantive comments on the book will be published elsewhere, so you may want to watch this space for updates; but what I want to do in this post, instead, is share some thoughts regarding research methods and the role of reviewers.
Specifically, what sparked my interest was one study in the collection, which used Likert scales to record participants’ attitudes towards a certain educational construct. Those who are not familiar with the fascinating minutiae of quantitative research can find a discussion of Likert scaling and ordinal data in the section that immediately follows. Those of you who are unlucky enough to have studied statistics may want to skip to the next section.
Likert scales and ordinal data
A Likert-type question (or ‘item’) asks respondents to select one of several (usually four or five) responses that are ranked in order of strength. Here’s an example:
Rate your agreement (or lack whereof) with the following statements using the scale below:
- Strongly agree
- Neither agree nor disagree
- Strongly disagree
1 2 3 4 5 Apples are rubbish Yoghurt is my favourite food Beans are evil Fish fingers and custard taste great
Sometimes sets of similar items are dispersed in the same questionnaire in order to probe different aspects of the same construct. When these items are put together, the combined findings can give us information about an underlying quality or belief. I will not go into this in more detail here (but if you feel so inclined, do have a go at finding what the underlying construct is in the example above!)
Likert items and scales produce what we call ordinal data, i.e., data that can be ranked. For instance, people who select answer (1) in the last item above like fish fingers and custard more than people who choose (2), (3), (4) and (5). People who choose (2) like this snack more than those who choose (3), (4) and (5) and so on. In addition to being ranked, ordinal data can be tallied: for example, I might want to count how many people chose each of the responses and compare their numbers. This, however, is almost the extent of what one can do with such data.
The problem with Likert items is that many researchers –including the ones whose paper prompted this post– tend to use them in order to do things that they were never designed to do. Calculating average scores is one of them, and here’s why it’s wrong:
Imagine that ten survey participants were asked about their attitudes towards fish fingers and custard. The table below shows a hypothetical distribution of answers:
|Neither agree or disagree||3||30|
If a researcher were interested in knowing the beliefs of a ‘typical person’ (whatever that might be), they might be tempted to calculate a mean score for this data. The formula one uses to calculate means is:
[(number of people who selected response 1)*(weighting of response 1) + (number of people who selected response 2)*(weighting of response 2)… (number of people who selected response n)*(weighting of response n)] / (total number of respondents)
In the example above, this would yield:
(1*1)+(1*2)+(3*3)+(2*4)+(3*5)/10 = 3.5
Going back to the descriptors, the researcher would then ascertain that a response of 3.5 corresponds to something between ‘no opinion’ and ‘disagreement’. They would therefore pronounce something along the lines of: ‘Our study revealed mild disagreement regarding the palatability of fish and custard (M=3.5)’.
Plainly put, this is statistical nonsense not an optimal interpretation. Such an argument relies on the assumption that the psychological distance between ‘strong agreement’ and ‘agreement’ is the same as that between ‘agreement’ and ‘no opinion’. Similarly, it seems to imply that the distance between ‘agreement’ and ‘strong disagreement’ is four times greater than that between ‘agreement’ and ‘strong agreement’. The mathematical model needs these assumptions in order to work, but they are simply not in the questionnaire design. And even if they were, they would constitute a gross distortion of psychological attitudes and the social world just to fit statistical moulds.
To put it in the simplest terms possible: Ordinal data cannot yield mean values. Period.
Update: If you came to this page looking for information on Likert scales, you may find the following posts useful: Things you don’t know about Likert scales, On Likert scales, levels of measurement and nuanced understandings and How to interpret ordinal data. I also recommend reading this post by Stephen Westland, Professor of Colour Science and Technology at the University of Leeds, for a more nuanced understanding of Likert scaling, and an excellent discussion of how to analyse the data that these scales produce.
On the review process
As I wrote at the beginning of this post, one of the papers in the volume that I reviewed reported on findings that had been generated by extracting mean values from Likert questions, i.e., by subjecting ordinal data to a type of analysis that they don’t support. In the authors’ defence, they were neither the first nor the last to engage in this controversial practice: averaging ordinal data is as widespread as it is wrong. Unfortunately, this problem had gone unnoticed by the editors of the collection, and by the peer-reviewers employed by the press. As the book had already been published, I was left wondering whether there was anything to be gained by flagging it at this stage.
It is the nature of the peer-review process that papers are often reviewed by people who can make intelligent substantive judgements on the findings, but may not always have the requisite background to comment on the research process. For better or for worse, research methods are too diverse and too specialized for reviewers to have more than a passing acquaintance with most of them. In addition, there are limits to the time one can reasonably spend providing unpaid service to the profession, and these often preclude reading up on research methodology every time one comes across a novel research design. Now and again, reviewers have to take it on faith that the people who conducted a study knew what they were doing, and they must trust that there are no major flaws in the methods. So, rather than double checking on such matters, we tend to focus our feedback on more substantive aspects of the research (e.g., Are the claims made commensurate to the scope of the study? Do the findings add significantly to the existing body of knowledge?). Mistakes in the methodology will, on occasion, slip by.
So what is one to do when one has to provide an informed opinion on the quality of a study that has a major flaw, bearing in mind that the people responsible for finding this flaw failed to spot it, or deemed it unimportant? In this case, I decided to let it pass: the findings of the study were inconclusive and broadly consistent with what was already known about the phenomenon in question. I thought that there was little harm in having in the literature one more voice that added a weak agreement to the prevailing views – even if this voice was not informed by very sound empirical evidence.
If there is a take-home message from all this, it is that readers should once again be cautioned against putting too much faith in the published literature: just because something has been printed, it isn’t always right.
Image Credit: The Leaf Project @ Flickr | CC BY-SA 2.0