On student evaluation: some thoughts

Today I had the pleasure of attending a professional development seminar for English Language teachers on the topic of student evaluation. Given its links to accountability, evaluation is a topic of critical importance in Greek education, so what follows is an attempt to contribute to the field’s professional discourse by further developing some of the topics that were raised.

In the seminar, we were presented with a framework for student evaluation in Modern Foreign Languages (MFL), which is derived from ministerial guidelines and theoretical work carried out at the University of Athens Research Centre for English Language. Evaluation procedures, we were told, must be:

  • Constant and purposeful
  • Dynamical
  • Valid
  • Reliable
  • Objective
  • Holistic
  • Collaborative

Readers are invited to reserve judgement on the pedagogical value of such an evaluation policy until it is implemented; for the time being, however, I feel compelled to register some reservations.

Poor theoretical conceptualization

Many of the constructs listed in the policy do not seem to be particularly well grounded οn the theory that informs our profession. For instance, we were told that holistic assessment refers to the student’s ability to use language in order to respond to set tasks, as measured along multiple communicative criteria:

Λέγοντας ολιστική εννοούμε πως τα κριτήρια αξιολόγησης που διαμορφώνουμε αφορούν την ικανότητα του μαθητή να χρησιμοποιεί τη γλώσσα που μαθαίνει για να κάνει τις δραστηριότητες που ορίζει το ΕΠΣ-ΞΓ, σύμφωνα πάντα με το επίπεδο επικοινωνιακής του επάρκειας. [this definition appears to derive its provenance from here, although it was not attributed during the seminar.]

This formulation seems to be at variance with standard definitions of the construct (e.g. Alderson et al. 1995: 107, below), and the reference to multiple criteria («κριτήρια») appears to be particularly problematic. Further compounding the issue, most of the examples we were shown appeared to be from analytical scales, where criteria such as “task accomplishment”, “grammatical correctness” and “lexical range” were used to calculate a “holistic” score.

Examiners may be asked to give a judgement on a candidate’s performance as a whole, in which case they will use a holistic scale… When examiners use this type of scale, they are asked not to pay too much attention to any particular aspect of the candidate’s production, but rather to make a judgement of its overall effectiveness. (Alderson et al. 1995: 107-8, original emphasis)

Lack of operational definitions

A second problem was that none of the dimensions in the list appeared to have been operationally defined. In other words, although they were defined in abstract theoretical terms, it was not immediately obvious how these theoretical definitions applied to practice.

In the interest of brevity, I will confine myself to the dimension of validity, which was (correctly) defined as overlap between what a test measures and what a test is presumed to measure. There are many aspects to validity e.g., content relevance and coverage, concurrent criterion relatedness or predictive utility (Bachman 1990), and it is hard to tell what the authors of the list had in mind, but this perhaps not so important.

What is important is that, for validity to be a meaningful construct, it is imperative that we – educators and stakeholders alike – share a common vision of what needs to be learnt, what skills are to be mastered, what traits are to be developed. Instead, it appears that the starting point of the policy we were presented with was a tool-kit of heterogeneous testing instruments, such as standardized tests, collaborative projects, language portfolios, and learning diaries, each measuring different, and sometimes undefined, competencies.

To avoid creating the impression that we have put the cart (what we can measure) before the horse (what we need to know), it is imperative that teachers are provided with clear operational definitions of what is tested, and why. In this sense, one is reluctantly compelled to say that the competent authorities do not appear to have done the best possible job.

Inconsistency of criteria

Thirdly, it would appear that in an attempt to be comprehensive, this list contains evaluation features that are difficult to reconcile with each other.

Anyone who has some experience with evaluation knows that there is tension between the goals of reliability and validity. Evaluation procedures that maximize reliability tend to sacrifice validity and vice versa. This happens because the most reliable measures are numeric, but the constructs we want to measure are not numbers. Similarly, collaborative evaluation (defined as engaging pupils in the design of evaluation instruments as well as scoring) is hard to reconcile with objectivity, because pupils – bless their hearts – are not always objective.

At the risk of sounding harsh, a wishlist of desirable evaluation features is something anyone can compile, but no one needs. What is needed instead is a policy that ranks such features in order of importance, or combines them in meaningful, unambiguous ways. This is a task that, apparently, remains to be done.


Given the comprehensiveness of the list, it is odd that no reference is made to practicality as a quality of evaluation. Practicality might be defined as the efficient use of available resources in order to attain evaluation objectives (cf. Bachman & Palmer 1996: 35-37). Practicality is important in two senses: most obviously, because a policy cannot be implemented if the resources necessary exceed those available; equally importantly, because teachers are responsible for the cost-effective distribution of a finite amount of resources to achieve both testing and learning objectives.

To illustrate: in the seminar, a five-step procedure was presented for using the European Language Portfolio, which was one of the several evaluation instruments that are to be deployed. According to a rough personal estimate (Table 1), the full implementation of the procedure requires 18 contact hours spread over an academic year. It should be borne in mind that in a given academic year, students receive approximately 86 hours of instruction in English, so it seems fair to say that, its pedagogical benefits notwithstanding, this particular evaluation procedure seems rather problematic.

Activity Contact hours
Introductory session 1 annually
Metacognitive reflection and completion of Language Biography 1 bimonthly
Presenting progress to peers 25 students x 5 minutes / student = 3 bimonthly
Metacognitive reflection and completion of Language Passport 1 annually
Total 18 annually

Table 1 – Integrating the European Language Portfolio in MFL (workload estimate)


In this post I argued that the evaluation policy for Modern Foreign Languages, as put forward by the competent authorities in our education system, contains several interesting features, but there are more than a few points which warrant more attention than was given. Some possible improvements include theoretical refinement of underlying constructs, the development of sound operational definitions for its dimensions, the unambiguous prioritization of those dimensions that are in tension, and the inclusion of considerations of practicality in planning. Unless such improvements are made – as a matter of urgency – there is a grave danger that students and teachers are seen to fail in a system where success is simply not possible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s