Some Thoughts on Student Evaluation in Greek ELT (Achilleas Kostoulas)

Today I had the pleasure of attending a professional development seminar for English Language teachers on the topic of student evaluation. Given its links to accountability, evaluation is a topic of critical importance in Greek education, so what follows is an attempt to contribute to the field’s professional discourse by further developing some of the topics that were raised.

In the seminar, we were presented with a framework for student evaluation in Modern Foreign Languages (MFL), which is derived from ministerial guidelines and theoretical work carried out at the University of Athens Research Centre for English Language. Evaluation procedures, we were told, must be:

Constant and purposeful
Dynamical
Valid
Reliable
Objective
Holistic
Collaborative

Readers are invited to reserve judgement on the pedagogical value of such an evaluation policy until it is implemented. For the time being, however, I feel compelled to register some reservations.

Poor theoretical conceptualisation

Many of the constructs listed in the policy do not seem to be particularly well grounded οn the theory that informs our profession. For instance, we were told that holistic assessment refers to the students’ ability to use language in order to respond to set tasks, as measured along multiple communicative criteria:

Λέγοντας ολιστική εννοούμε πως τα κριτήρια αξιολόγησης που διαμορφώνουμε αφορούν την ικανότητα του μαθητή να χρησιμοποιεί τη γλώσσα που μαθαίνει για να κάνει τις δραστηριότητες που ορίζει το ΕΠΣ-ΞΓ, σύμφωνα πάντα με το επίπεδο επικοινωνιακής του επάρκειας. [although not attributed during the seminar, this definition appears to derive its provenance from a set of instructions by the University of Athens Research Centre for English Language]

Despite its august origins, this definition seems to be at variance with more established definitions of the construct in the literature. The reference to multiple criteria («κριτήρια»), in particular, seems surprising given the commonly understood meanings of the term ‘holistic’. Further compounding the issue, most of the examples we were shown appeared to be from analytical scales, where criteria such as “task accomplishment”, “grammatical correctness” and “lexical range” were used to calculate a “holistic” score. As Alderson et al. (1995: 107) point out:

Examiners may be asked to give a judgement on a candidate’s performance as a whole, in which case they will use a holistic scale (…) When examiners use this type of scale, they are asked not to pay too much attention to any particular aspect of the candidate’s production, but rather to make a judgement of its overall effectiveness. (Alderson et al. 1995: 107-8, original emphasis)

It is difficult to know what to say when a state apparatus is given a flawed definition by experts who should know better, and then goes on to impose it, top-down, to assess students across an entire country.

Lack of operational definitions

A second problem was that none of the dimensions in the list appeared to have been operationally defined. In other words, although they were defined in abstract theoretical terms, it was not immediately obvious how these theoretical definitions applied to practice.

In the interest of brevity, I will confine myself to the dimension of validity, which was (correctly) defined as overlap between what a test measures and what a test is presumed to measure. There are many aspects to validity, such as content relevance and coverage, concurrent criterion relatedness or predictive utility (Bachman 1990), and it is hard to tell what the authors of the list had in mind, but this perhaps not so important.

What is important is that, for validity to be a meaningful construct, it is imperative that we —educators and stakeholders alike— share a common understanding of what needs to be learnt, what skills are to be mastered, what competences are to be developed. Instead, it appears that the starting point of the policy with which we were presented was a composite toolkit consisting of many heterogeneous testing instruments, such as standardised tests, collaborative projects, language portfolios, and learning diaries, each measuring different, and sometimes undefined, competencies.

To avoid creating the impression that we have placed the cart (what we can measure) before the horse (what we need to know), teachers need to be provided with clear operational definitions of what is tested, and why. In this sense, one is reluctantly compelled to say that the competent authorities do not appear to have done the best possible job.

Inconsistency of criteria

Thirdly, it would appear that in an attempt to be comprehensive, this list contains evaluation features that are difficult to reconcile with each other.

Anyone who has some experience with evaluation knows that there is tension between the goals of reliability and validity. Evaluation procedures that maximize reliability tend to sacrifice validity and vice versa. This happens because the most reliable measures are numerical, but the constructs we want to measure (e.g., the ability to understand an announcement, infer irony in a written text, make your point tactfully in a foreign language) are not numbers. Similarly, collaborative evaluation (defined as engaging pupils in the design of evaluation instruments as well as scoring) is hard to reconcile with objectivity, because pupils — bless them— are not always consistent or objective.

At the risk of sounding harsh, a wishlist of desirable evaluation features is something anyone can compile, but no one needs. What is needed instead is a policy that ranks such features in order of importance, or combines them in meaningful, unambiguous ways. This is a task that, apparently, remains to be done.

Limited practicality

Given the comprehensiveness of the list, it is odd that no reference is made to practicality as a quality of evaluation. Practicality might be defined as the efficient use of available resources in order to attain evaluation objectives (cf. Bachman & Palmer 1996: 35-37). Practicality is important in two senses: most obviously, because a policy cannot be implemented if the resources necessary exceed those available; equally importantly, practcallity matters because teachers are responsible for the cost-effective distribution of a finite amount of resources to achieve both testing and learning objectives. This, of course, means that any resources spent testing are not available for teaching.

To illustrate: in the seminar, a five-step procedure was presented for using the European Language Portfolio, which was one of the several evaluation instruments that are to be deployed. According to a rough personal estimate (Table 1), the full implementation of the procedure requires 18 contact hours spread over an academic year. In my calculations, I used the instructions with which we were provided, and drew on personal teaching experience to supplementing any missing information. I also assumed that this work would be done in class rather than as homework, because it is an assessment procedure. It should be borne in mind that in a given academic year, students receive approximately no more than 86 hours of instruction in English [this was accurate at time of writing; the number is even lower now]. So it seems fair to say that, its pedagogical benefits notwithstanding, this particular evaluation procedure (which is one of several such procedures) seems rather excessively demanding.

Activity	Contact hours
Introductory session	1h annually
Metacognitive reflection & completion of Language Biography	1h bimonthly
Presenting progress to peers	25 students x 5 minutes / student = 3h bimonthly
Metacognitive reflection & completion of Language Passport	1h annually
Total	18h annually

Table 1 – Integrating the European Language Portfolio in MFL lessons in Greek schools (workload estimate)

Summary

In this post I argued that the evaluation policy for Modern Foreign Languages, as put forward by the competent authorities in our education system, contains several interesting features, but there are more than a few points which warrant more attention than was given. Some possible improvements include theoretical refinement of underlying constructs, the development of sound operational definitions for its dimensions, the unambiguous prioritisation of those dimensions that are in tension, and the inclusion of considerations of practicality in planning. Unless such improvements are made —as a matter of urgency— there is a grave danger that students and teachers are seen to fail in a system where success is simply not possible.

About this post: This post was written in response to the introduction of a short-lived evaluation policy in the Greek state school system, and it is a concise version of feedback that I was asked to give. The proposed system proved stillborn, and – despite the enthusiasm of the apparatus that tried to implement it – it was abandoned before it could be fully rolled out, though not before a substantial amount of money had been invested in its design and teacher training. I suppose I could (or should), have let things take their course, but at the time it seemed important to voice my concerns.