In our system so much hangs on the value given to our assessments leading to qualifications. As we seek to measure learning with some degree of accuracy, we risk losing contact with the meaning of what the nature of the learning is. Our increasing need for measures that are reproducible, consistent and transparent decreases our capacity to accept the inherent uncertainty in the whole enterprise. In our attempt to create a fair system where everyone knows what the standards are, we impose a framework that is outwardly rigid and linear relative to the fuzzy, dynamic nature of the processes we are measuring. In our classrooms we create clouds of learning – ephemeral, shape-shifting, finite but with indefinite boundaries – and then try to count them up like building blocks. It leads to all kinds of problems.
Here are some things to think about:
The value of marks.
It’s so normal to us to set tests where marks are awarded for certain questions. We go through a process of allocating marks to questions more or less on the basis that the marks are broadly equal in size. Two students gaining 14/20 on a test could be assumed to have performed equally well – or to have the same level of knowledge. But look at this GCSE Chemistry paper. There are five different ways to gain a mark. The unit size for each mark is different – there’s no meaningful way for them to be the same; concepts can’t be measured in terms of difficulty in that way. In the question below, six marks are available in a more open-ended response:
Why 6 marks? Why would a full response to this question be more valuable than five correct answers as given above? It’s a matter of judgement from an expert paper-setter but there is no absolute sense in which this is a truth. Of course tests like GCSEs are designed to cover a range of ability levels; we need a range of easy and difficult marks so that the papers give everyone a chance to show their knowledge. Question 6 might well be worth more marks but that would overly penalise those students who found it difficult. So, marks are deliberately weighted to create graduations that allow for the assessment to function as a whole. We assume that by averaging up the scores, we smooth out the variation in difficulty and, as it’s the same for everyone, we don’t worry too much about the individual mark values.
But, take it to a boundary scenario – eg student A with 41/60 and student B with 42/60 , where 42 is the grade boundary. The assumption that student B has ‘more knowledge’ or ‘knows more facts’ is highly dubious. It really depends on which questions they got right. It’s a fact that, with a different combination of equally plausible mark weightings, students A and B could reverse their scores. And yet that cliff-edge will have real-life consequences for those students. There’s an element of arbitrariness in the mix that doesn’t have enough recognition in the whole process. What is a mark worth and why? We should ask ourselves this more often.
There is no correct score for an essay.
Having spoken to senior people from OfQual last year and having been through the painful post-results AQA appeal process documented here, I am now fully aware of the idea that there is no correct mark for an essay. An English essay marked out of 50 may score 33 or 34 or 35 or 36, depending on the judgement of the examiner. That’s not unusual. Why? Because it’s difficult for two people to reach precise agreement mapping several hundred words of writing against the criteria. There is a judgement factor that is inherent in the very concept that an essay demonstrates understanding at a certain standard. It’s not a free-for-all but there is a legitimate level of uncertainty or tolerance. So, if a paper is re-marked, there is a good chance that the second marker will score it more highly than the first; this doesn’t mean either marker is wrong. But, imagine if schools submitted all papers for second marking – and demanded the highest of any marks given; it would push up everyone’s marks systematically. This is why there is a good case not to give students their papers back after exams – unless the marking discrepancies are much more significant.
Alternatively, if we want to feel that essay-based subjects are assessed with more precision, we’d need to re-think the idea of essays altogether. Why is an essay the best way to assess English Lit or History or Politics? Why not multiple choice tests or tests more like those in science? Isn’t it because essay-writing is an integral part of the tradition of those disciplines – the open-ended nature of constructing an argument backed up with evidence? So, if we stick with essays, we need to accept the uncertainty principle and its consequences. On the other hand, if we want more precision, then perhaps essays aren’t the way to assess the learning.
Anecdote: The value of an essay grade.
My colleague Tim (@musotim) shared this story last week: A (now retired) colleague was challenged by a parent at parents’ evening: Why is this essay only worth a B? The teacher looked at it and then proceeded to explain how he’d applied the assessment criteria, leading to a B being the best fit. The parent wasn’t happy. ‘I still think it’s worth an A’. The teacher, very wise but slightly weary, said ‘OK’. He took the essay back, got out his pen and wrote ‘A’ on the top before handing it to the parent. ‘There you are’. Had this changed the quality of the essay? No. Did it change the true value of the essay? No. It was a Pyrrhic victory for the parent – who knew this all too well. Tim and I laughed a lot about that – and all the underlying implications. What is an A? How can we be sure? And who is it for? We need to think these things through.
Grade boundaries. Fuzzy clouds with cliff edges of uncertainty.
We all know that grade boundaries are created by a statistical overview of standards (all standards are derived from the bell curve, like it or not.) But we’re not as up-front as we should be about the blurred edges. It feels right to use impressionistic grades to describe the standard of students’ work. That’s about an A; that’s probably an A*; that’s more like a B. Here we’re conceiving of a grade as a general range – more like a cloud. In fact, we’re happy to conceive of these standards overlapping – students rarely perform absolutely consistently at a certain standard and any piece of work has strengths and weaknesses. But, despite our intuition for the fuzzy edged reality of grades as standards, we’re still likely to assume that a student with an official A has definitely done better than a student with a B. We load up the final grades with meaning. I’ve argued that all exams should be reported with scores. At least 49/80 and 50/80 can be seen for what they are – very close; actually pretty much identical. That change is unlikely to happen because the consensus seems to be that the satisfyingly fuzzy breadth of A, B and C (soon to be the 1-9 scale) overrides the awkward precision of a grade boundary. I’d argue that the fuzziness around an apparently more precise score with the freedom from any grade boundaries at all would be a better application of the uncertainty principle. We can’t have both.
The illusion of linearity: KS3 Assessment and residual devotion to NC Levels.
With National Curriculum levels on the way out, schools are busy trying to work out alternatives. So far, the signs are that there’s huge inertia and most schools will be sticking to what they know for a good while yet.
There is a seductively reassuring illusion of steady progress afforded by the Level Ladder concept. We go from L5 to L6 and then L7. But we can do better- recognising that each level is broad, because we have sub-levels. So across the nation parents are presented with reports littered with the linear steps of 5b to 5a to 6c to 6b. The truth is that sub-levels are simply a device to create a sense of progress- they are not measures of absolute standards. We have no meaningful way of identifying that, in History for example, 5b to 5a is the same size jump as 6c to 6b. ‘Size’ in this context is a bizarre notion. And yet the steps on the left are presented as the path of progress – as if each step has the same size, not only within subjects but between them too. It’s reasonable to assume 6b in Maths is much the same level as 6b in Geography (as if those things can be compared in absolute terms.) In reality the middle and right hand steps might be closer – who knows? Of course the bell-curve brings things into line one way or another but, all too often we forget that levels are not real – they are a device and nothing more. It should be a relief to schools that NC levels are going – because now we can invest energy in developing our own devices without the pretense that we’re measuring in terms of absolute standards determined at a national level – when we’re not and never were.
If we think of levels as rough standards exemplified through moderated pieces of work and sets of level-defining questions, then a loose set of rising clouds is a better metaphor than a set of steps. An a,b,c code could be added but only as another loose indicator of movement. It’s a code for a general, subjective notion of progress; not a measure of steadily rising standards. The assessment uncertainty principle can allow for the former, but not the latter.
Levels of Progress
Once we apply the concept of uncertainty to levels, ‘levels of progress’ is very shaky as a meaningful measure. Sub-levels of progress is almost ‘finger in the air’ stuff. If we want to see how much progress students have made, we should look at their work and the questions they can answer and compare that over time. The data code for that process will only be a crude proxy – never an actual measure. On that basis, the way we will need to show that our students are making progress is by comparing our ‘real-work progress files’ with those from our statistical neighbouring schools. I’ll show you mine if you show me yours. What do your highest attaining Y9s do in Geography? How does it compare to what we do. Let’s take a look. This is what working within the uncertainty principle allows; like-for-like comparison of real work is the most meaningful way to compare standards. Once we’ve opted for the data-proxy, our uncertainty dial goes off into the red – Maximum Validity Skepticism. Klaxon!!
Communicating Standards to Students: Focus on the work itself.
I’ve seen teachers go to great lengths to communicate standards to students via level ladders or grade descriptors. They invest time and energy showing the students what a level 5 might look like, involving them in assessing their work to see whether it is level 5. Then they explain the features of generic Level 6 and hope that they then take that on board in order to produce improved work. It’s all logical enough – but it’s so convoluted. Why not just go direct. If you can show a student how to improve their work – why does it help or matter if they can label the work with a level before and after? In my experience, the energy spent on explaining the levels is often dissipated beyond the point of being useful; teachers do it because they think they should or because it’s something to put in their mark books. Actually only the most sophisticated learners can use the generic criteria meaningfully to improve their work. I’d suggest we could cut out a lot of this quasi-proxy-meta assessment coding and stick to the organic, intrinsic properties of the work. The grading can come later.
Drama, PE and Science Practicals and Speaking and Listening in English.
My final thoughts are about the way we’ve been trying to address uncertainty in different examinations – leading to them being compromised. Where we’ve known that the criteria are heavily open to interpretation, we’ve tended to set out some tight criteria that help students to get the best possible grades. However, over time this had led to an artificial squashing of the overall outcomes, skewed to the high end. In science practical assessments at GCSE, the practical skills and written analysis have become so formulaic that they’ve lost meaning as a way to set standards. The English Speaking and Listening assessments were similar. If you give everyone a recipe for a cake, then you’re going to get a lot of decent cakes; that doesn’t mean that everyone can bake. In Drama and PE, the issue is that the practical components are hugely subjective. How do you measure being good at rugby? How can you consistently assess the quality of students’ individual drama performances comparing different group performances of pieces with different dynamics? But surely it’s worth exploring alternative methods of assessing these subjects rather than trying to tighten them up so they look more like the others. If we created a tighter assessment regime for Drama – what would the subject look like? We’d have lost the essence of it in order to create a more precise exam. Or we could simply embrace the uncertainty for what it is and live with it. That doesn’t mean it is soft or easy; it is what it is – a subject where uncertainty is an intrinsic property.
Other Posts with related content: