Reporting data inaccurately: why, when and how to confidentialise data? (Achilleas Kostoulas)

Research, we surely agree, is about finding the truth and accurately reporting it. In this post, however, I will talk about those cases when we might have to actually distort the truth while reporting on a research project. This is normally done in the interest of protecting research participants, and it is called ‘confidentialising’. In the paragraphs that follow, I will discuss why we sometimes have to report data that is not quite accurate; when this is necessary; and how to do it.

Why is it necessary to confidentialise data?

Almost invariably, people who participate in a research project do so under the conditions of confidentiality and anonymity: it is understood that the researcher(s) will not personally identify any research participants, or attach any names to particular pieces of data. Indeed, participants normally receive written assurances to that effect, and there may be civic liabilities associated with breaching these agreements, as well as serious disciplinary consequences for academic malpractice. Further, the principle of non-malfeasance stipulates that any insights to be gained by research should not be at the detriment of participants: simply put, in doing research we must do no harm.

The problem is that, when reporting small data sets in particular, a detailed representation of the data could sometimes allow participants to be individually identified, and this could have unintended consequences. Consider the following hypothetical example: Let us assume that, in the process of conducting research in a school, information is collected regarding the job satisfaction of the staff (Table 1). The researcher will likely want to report this information, but they may well be concerned that when it is published, the school authorities will find out the French language teacher is very unhappy with her job, and this may have consequences for her long-term employment prospects. What can be done?

Table 1 – Job satisfaction distribution by teacher speciality

When must we confidentialise data?

As a rule of thumb, there are two things that researchers need to consider in order to make an informed decision about confidentialising data: the frequency of participants in the cell (the quantity criterion), and the impact that identification is likely to have (the quality criterion).

Regarding the quantity criterion, the threshold value is usually set to 3, 5 or 10: If the threshold value is set at 3, the researcher may decide to confidentialise any cell which has a frequency count of 1 or 2 (cells with a frequency of 0 don’t normally need to be considered). Setting an appropriate threshold value depends on factors such as the size of the data set and any relevant local legislation or organizational rules.

In the example above, there are 12 cells which violate the quantity criterion at the threshold level of 3. Strictly speaking, all of these cells present a disclosure risk, but the ones that are particularly problematic are the ones pertaining to the French language teacher, the German language teacher and the PE instructors. In these cases in particular, any reader will be able to deduce with certainty the answers given by the respondents.

However, it may not be necessary to confidentialise all these cells, and this is where the quality criterion comes into play. It could be argued that the information that the PE instructors and the German language teacher are satisfied with their jobs is not potentially detrimental. It could also be argued that this information is analytically significant, because it is indicative of a trend (specialists seem to be happier than generalists). On the other hand, the cell with the information about the French teacher does not just pose a great disclosure risk; it can also be a liability to the person involved. In this case, it would be difficult to argue that whatever insights the research generates outweigh the consequences for the person involved.

How to confidentialise data

There are three techniques which the hypothetical researcher might use in order to convey the information he has collected, without compromising the participants’ anonymity.

1. Category manipulation

It may be possible for the researcher to conceal sensitive information by merging the categories with the smallest number of participants. In the example above, one might want to merge English, French and German language teachers into a Modern Foreign Language teachers category (Table 2a), or even collapse several categories into ‘specialist teacher’ category (Table 2b). Doing so would conceal the sensitive information without much loss of analytical detail.

Table2a – Example of category manipulation (MFL)

Table2b– Example of category manipulation (Specialists)

When collapsing categories, researchers should make sure that the overarching categories are psychologically or socially ‘valid’ (e.g. do teachers perceive themselves as being divided into ‘generalists’ and ‘specialists’?). This can be established through participant validation. They also need to take care that the process of category merging does not result in the loss of analytical detail. That is, the distribution of responses across categories needs to be fairly similar. This may be possible to establish statistically.

2. Data suppression

Data suppression involves removing sensitive information from the data set. In our example (Table 3), data suppression would most obviously involve removing the answer that was given by the French language teacher. In addition, it would involve all other possible responses, so as to prevent people from deducing her answer mathematically. In our hypothetical example, it also seemed necessary to remove instances about the German language teacher and the PE instructors, because if I left the positive responses, an intelligent reader would deduce that the one I omitted must have been negative. Finally, note that information about category totals has been presented in ways that do not allow inferences about the possible responses.

Table 3 – Example of data suppression

Obviously, there is considerable tension between data suppression and the imperative to publish research results: there would be very little point in collecting data which one would then withhold from the academic community or other stakeholders. Data suppression is a straightforward measure, but is best used sparingly.

3. Data rounding

A third technique for confidentialising data involves rounding of all data to a suitable number, e.g. the nearest multiple of 3, 5 or 10. Data rounding involves distorting the data, so the rounding base is selected with a view to minimizing this distortion. In the interest of simplicity, the same rounding base should be used throughout, but sometimes a graduated rounding strategy may be used: for example numbers under 100 may be rounded off to base 3, and numbers over 100 could be rounded off to base 10.

In Table 4, the data have been rounded to base 3, which seemed most appropriate given the small numbers involved (virtually all cells were < 3). Even so, the rounding strategy involved the loss of some information, particularly in the “very satisfied” column.

Table 4 – Example of data rounding to base 3

Data rounding works best with larger data sets, where the distortion created by rounding off is not quite so evident. For smaller data sets, such as this example, other methods may be more appropriate.

Concluding remarks

Condfidentialising data is on occasion necessary, but it often comes with a loss of analytical detail or a distortion of the data. Researchers will have to decide whether the loss of data is justified in relation to the disclosure risks. The choice of method will often depend on the particularities of the project, such as the nature of the research questions, the distribution of data and any requirements set by the funding agencies or other stakeholders. In all cases, however, researchers need to to transparently report what data manipulation procedures were used, as well as why they were deemed essential.

The information in this page was based on a Confidentiality Information Sheet prepared by the Australian National Statistical Service.