Welcome! Chances are that you landed on this page looking for information on Likert scales and averages. If that is the case, you will probably be want to skip directly to the part of this post where I talk about a common mistake people make with ordinal data and mean values. You should also take a look at the list of additional resources.
This post has been prompted by an edited collection that I was recently asked to review. Substantive comments on the book have been published elsewhere; but what I want to do in this post, instead, is share some thoughts regarding a common statistical mistake and a common misconception about published works.
Specifically, what sparked my interest was one study in the collection, which used Likert scales to record participants’ attitudes towards a certain educational construct. Those who are not familiar with the fascinating minutiae of quantitative research can find a discussion of Likert scaling and ordinal data in the section that immediately follows. Those of you who are unlucky enough to have studied statistics may want to skip to the next section.
Likert scales and ordinal data
What are Likert scales?
A Likert-type question (or ‘item’) asks respondents to select one of several (usually four or five) responses that are ranked in order of strength. Here’s an example:
Indicate what you think about the following statements using the scale below:
(1) Strongly Agree; (2) Agree; (3) Neither agree nor disagree; (4) Disagree; (5) Strongly Disagree
|a. Apples are rubbish||1||2||3||4||5|
|b. Yoghurt is my favourite food||1||2||3||4||5|
|c. Beans are evil||1||2||3||4||5|
|d. Fish fingers and custard taste great||1||2||3||4||5|
Each of these items measures a variable, i.e., a construct about which we want to learn more. Sometimes, sets of similar items are dispersed in the same questionnaire. This helps researchers to probe different aspects of the same construct (or ‘latent variable’), by putting together information from all the related items. I will not go into any of this in more detail here, but if you want to find out more, this post has some additional information.
Likert scales are very frequently used to measure constructs like satisfaction rates, attitudes towards different things, and more. They are very flexible and very useful, provided you use them carefully.
Interpreting Likert scales
Many researchers tend to use Likert scales to do things that they were never designed to do
Likert items and scales produce what we call ordinal data, i.e., data that can be ranked. In the example above, people who select response (1) to item (d) are more fond of fish fingers and custard than people who choose responses (2), (3), (4) and (5). People who choose response (2) like this snack more than those who choose responses (3), (4) and (5), and so on. In addition to being ranked, ordinal data can be tallied: for example, I might want to divide by sample by age group, count how many people chose each of the responses, and compare results across ages. This, however, is almost the extent of what one can do with such data.
The problem with Likert items is that many researchers –including the ones whose paper prompted this post– tend to use them in order to do things that they were never designed to do. Calculating average scores is one of them, and here’s why it’s wrong:
Imagine that ten survey participants were asked about their attitudes towards fish fingers and custard. The table below shows a hypothetical distribution of answers:
|Neither agree or disagree||3||30|
The wrong way to do it
If I were interested in knowing the beliefs of a ‘typical person’ (whatever that might be), then might be tempted to calculate a mean score for this data. ‘Mean’ is a technical word for ‘average’. To do this, I might use the following formula:
[(number of people who selected response 1)*(weighting of response 1) + (number of people who selected response 2)*(weighting of response 2)… (number of people who selected response n)*(weighting of response n)] / (total number of respondents)
In the example above, this would yield:
(1*1)+(1*2)+(3*3)+(2*4)+(3*5)/10 = 3.5
Going back to the descriptors, I would then ascertain that an ‘average’ response of 3.5 corresponds to something between ‘no opinion’ and ‘disagreement’. They would therefore pronounce something along the lines of: ‘Our study revealed mild disagreement regarding the palatability of fish and custard (M=3.5)’.
A better way
Plainly put, the option suggested above is statistical nonsense not an optimal interpretation (update: I feel less strongly about this than I used to in 2013, but I still think it is usually wrong).
For this interpretation to be valid, I would need to make assumptions like the following:
- Firstly, I would be need to assume that the psychological distance between ‘strong agreement’ and ‘agreement’ is the same as that between ‘agreement’ and ‘no opinion’.
- A corollary of the above would be that the distance between ‘agreement’ and ‘strong disagreement’ is four times greater than that between ‘agreement’ and ‘strong agreement’.
The mathematical model needs these assumptions in order to work, but they are simply not in the questionnaire design. And even if we forced them into the questionnaire, that would constitute a gross distortion of psychological attitudes and the social world to fit our statistical mould.
Ordinal data cannot yield mean values. If you think they can, do so at your own risk.
To put it in the simplest terms possible: Ordinal data cannot yield mean values. If you think that they can (and some statistics guidance websites might encourage you to think so), you can still take your chances. But please make sure you justify your choice well when you write up your methods section.
A safer way forward, if you are interested in finding what the ‘average’ or ‘typical’ response is, is to look at the median response. The median is a type of average value, like the mean, except that it shows the number that is exactly in the middle of the data, i.e., at the same distance from the highest and lowest value in the dataset. You can find out more about how to calculate the median here.
More to read about Likert scales
If you came to this page looking for information on Likert scales, you may find the following posts useful: Things you don’t know about Likert scales, and On Likert scales, levels of measurement and nuanced understandings. I also recommend reading this overview of Likert scales and this post by Stephen Westland (University of Leeds), for a more nuanced understanding of Likert scaling, and an excellent discussion of how to analyse the data that these scales produce.
On the peer review process
As I wrote at the beginning of this post, one of the papers in the volume that I reviewed made the statistical mistake that I just described, namely it described a set of findings that had been generated by extracting mean values from Likert items. In the authors’ defence, they were neither the first nor the last to engage in this controversial practice: averaging ordinal data is as widespread as it is wrong. Unfortunately, this problem had gone unnoticed by the editors of the collection, and by the peer-reviewers employed by the press. As the book had already been published, I was left wondering whether there was anything to be gained by flagging it at this stage.
What went wrong with peer review in this instance?
Readers often take it on faith that the people who conducted a study knew what they were doing. This faith is sometimes misplaced.
It is the nature of the peer-review process that the people who review academic articles can make intelligent substantive judgements on the findings, but might not always have the requisite background to comment on the research process (or visa versa) . For better or for worse, research methods are too diverse and too specialized for reviewers to have more than a passing acquaintance with most of them. In addition, there are limits to the time one can reasonably spend providing unpaid service to the profession, and these often preclude reading up on research methodology every time one comes across a novel research design.
Every now and then, reviewers have to take it on faith that the people who conducted a study knew what they were doing, and they must trust that there are no major flaws in the methods. So, rather than double checking on such matters, we tend to focus our feedback on more substantive aspects of the research (e.g., Are the claims commensurate to the scope of the study? Do the findings add significantly to the existing body of knowledge?). Mistakes in the methodology will, on occasion, slip by.
What can you do if you come across mistakes in published research?
So the question I faced was: what should I do when asked to provide an informed opinion on the quality of a study that has a major flaw? This was not made any easier by the knowledge that the people responsible for finding this flaw had failed to spot it, or deemed it unimportant. So, in this case, I decided to let it pass. Besides, the findings of this particular study were inconclusive and broadly consistent with what was already known about the phenomenon in question. I therefore thought that there was little harm in having one more voice in the literature to add some more weak agreement to the prevailing views – even if the empirical evidence that informed this voice was not very strong.
If there is a take-home message from all this, it is that as a reader you should not put too much faith in the published literature. Just because something has made it to the printing press, it isn’t always right.
About this post: This post was originally written in 2013 for my blog (www.achilleaskostoulas.com). For reasons I do not fully understand, it has come to rank very highly on SERPs about Likert scale measurement. It has been revised several times since, with a view to making it more useful to readers who are looking about statistical advice. The last update was on February 2020. The featured image comes from The Leaf Project @ Flickr and is shared via a CC BY-SA 2.0 license.