Much has been written in the Greek press about the latest university entrance exams, and I would like to add my own perspective in partial defence of the exam board. To provide background into the recent controversy, one of the Physics problems that candidates were required to answer seems to have been ambiguously worded. Because of this poor wording, the problem could not be solved by reference to the set coursebook, and it was -in the view of the Union of Science Teachers- ‘scientifically flawed‘.
A furor ensued, which forced the Central Examination Committee to initially instruct markers to disregard that particular problem. Subsequently, when samples of marked papers started coming in, it became apparent that some of the stronger candidates (12%, according to one source) had indeed provided sound answers by applying creativity and intuition to the problem. In light of the new evidence, the Committee revoked their initial decision, i.e. they finally decided that the problem would count. This debacle has brought to the fore a series of questions, ranging from the exam format, to the unfairness of one-off high-stakes exams, to the standards of teaching.
I don’t think that I can add much to that debate, but I would like to put forward that much of this controversy reveals serious misunderstandings regarding the nature of exams, and I hope I might help to clarify some of these issues by disambiguating some fundamental concepts.
Norm-referenced & Criterion-referenced tests
Very roughly speaking, examinations can be classified into two types: norm-referenced exams and criterion-referenced exams. In this section I’d like to lay out a broad-strokes description of their differing features, before writing about the tasks that are appropriate to each type, and how they relate to our university entrance exams.
In a norm-referenced test, candidates are ranked according to their scores, and then a certain number are deemed to have passed. The number of successful candidates is arbitrary and usually determined by some kind of inflexible consideration, such as the number of places available in a university programme. An analogy can be drawn to a Formula 1 race: drivers are ranked according to how fast they drive, and then the first three get to spray each other with champagne. These tests are useful in selecting, from a pool of candidates, those who are best qualified for some particular purpose.
Criterion-referenced tests rely on some kind of standard that is defined a priori. Exam tasks are designed to test whether the criterion has been mastered, and a cut-off point is set, below which performance is deemed to be unacceptable. Any candidate who scores higher than the criterion is considered to have succeeded. To extend the driving analogy, criterion-referenced tests are like driving exams: driving competence is reduced to a number of measurable skills (e.g. you should be able to park in so-many moves), and anyone who can successfully perform these tasks is licensed to drive. These tests are useful in order to ensure that certain tasks are only carried out by qualified individuals.
Designing tasks to fit a purpose
These different purposes require very different examination toolkits. In the case of criterion-referenced tests, it’s necessary to pre-measure the facility values of test items, and ensure that they are consistent (i.e. the items need to be standardized). This ensures that candidates in one sitting are not unduly disadvantaged compared to their peers who may have had easier questions. Criterion-referenced exam items also need to demonstrably measure constructs that are relevant to the stakeholders’ needs (i.e. they need to satisfy the validity criterion). For example, it could be argued that a candidate competing for a Classical Literature programme will need a score of at least 75% in Ancient Greek, or else they will not be able to engage with the texts, and this pronouncement should ideally be grounded on statistical data showing correlations between university entrance exam scores and eventual academic attainment. Because the only necessary output in criterion-referenced testing is whether the candidate has mastered the criterion or not, criterion-referenced tests can be quite crude, as long as the criterion is well-defined and the items map on to it with consistency.
Norm-referenced tests are quite different: what is important in this case is that the test should produce an unambiguous ranking that clearly differentiates among candidates. To achieve this, candidate scores should be spread out as much as possible, preferably on an normal distribution, and ideally one where the median coincides with the middle of the scoring range. Simply put, an ideal score range will have very few candidates achieving top marks, a fair number in the mid-to-high range, many candidates in the mid-range, and then the distribution would tail off to a fair number in the mid-to-low range, and very few in the left (low) tail. Because candidates are spread out over a wide range of scores, it is easier to distinguish differences in their performance, which (in theory at least) ensures a fairer ranking. Note that, while items need to be stratified in terms of facility, standardization with previous years is unnecessary, because candidates are only competing against their cohort. Also note that a cut-off point is irrelevant to norm-referenced testing.
The University entrance exams
In the final section of this post, I’d like to argue that our university entrance exams confound the two systems, and that many of the problems associated with their administration stem from this confusion.
From the previous discussion, it should be clear that the university entrance exams should, by definition, be norm-referenced, because the purpose of these exams is to distribute a large number of candidates into a limited number of university places. In the past, the ratio of places to candidates used to be 1:4, which meant that only those candidates in the right (high) tail of the distribution entered university. Recent political decisions to increase the intake of students has meant that the ratio is nowadays nearly 1:1,5, which means that candidates in the left (low) tail of the distribution are also admitted into some form of Higher or Further Education. Despite the outcry about low standards, this is not a flaw in the selection process, but rather an outcome of education policy: simply put, if we want everyone to have some form of tertiary education, then that must include those high school graduates who are academically weak.
To counter mounting dissatisfaction with the perceived drop of standards, a short-lived and misguided attempt was made in 2006 to introduce a cut-off point into the exam system. The cut-off point was arbitrarily set at 50% of the maximum score, by projecting the standards of secondary education onto the university exams. This decision confounded criterion- with norm-referencing. Examinations in secondary schools are about ensuring that no students are promoted to a grade level if they don’t have the skills to succeed in it, i.e. they are criterion-referenced. It’s not clear why, but the mastery level has always been defined as 50% of the maximum possible, and it’s expected that all students should, in principle, be able to master this level. But, as I previously suggested, in an efficient norm-referenced exam, 50% is where the median score should be, i.e. it is technically desirable that half the candidates have scores lower than that in order to attain maximum score dispersal.
The outcome of this decision was that test items with high facility values had to be designed, and scores were crowded in the right (top) quartile of the distribution. Often the decision of whether one would study to be a doctor or a vet hinged on scoring differences of a fraction of a point, i.e. well below the margin of scoring error. Despite the grave flaws in that system, public opinion seemed content because the resultant grade inflation created the illusion that standards were improving, and that tax-payers were spared the expense of educating “boulder-brained” students.
The cut-off provision was eventually revoked in 2009 (for reasons which, I suspect, were only partly academic), but the public perception remains that, unless the majority of candidates do well in the exams, there must be something wrong with either the test items or with their academic preparation. The protest by the Union of Science Teachers is typical of this attitude:
τα [θέματα] απαιτούσαν μεγάλη εμπειρία και ιδιαίτερη διαίσθηση στη φυσική, σαν να επρόκειτο για διαγωνισμό ταλέντων φυσικής.
The test items required great experience and special intuition in physics, as if it was a physics talent contest.
I hope it’s clear from the exposition above that norm-referenced exams are, in fact, a contest of sorts, and that challenging questions which result in a wide dispersal of scores and encourage creative thinking are, in my opinion, necessary components of an efficient university entrance exam. As a result, I find it hard to blame the exam committee for their selection of tasks, and I am puzzled by such objections. That having been said, it is also my belief that in the face of such sustained criticism, it is the responsibility of academic entities to stand by their informed decisions and educate the public opinion as to their rationale, rather than be swayed by angry rhetoric, vociferous complaints and hidden agendas. It is in this sense that I believe the Examination Committee has failed its mandate.