The Assessment Uncertainty Principle

In our system so much hangs on the value given to our assessments leading to qualifications.  As we seek to measure learning with some degree of accuracy, we risk losing contact with the meaning of what the nature of the learning is.  Our increasing need for measures that are reproducible, consistent and transparent decreases our capacity to accept the inherent uncertainty in the whole enterprise.    In our attempt to create a fair system where everyone knows what the standards are, we impose a  framework that is outwardly rigid and linear relative to the fuzzy, dynamic nature of the processes we are measuring.  In our classrooms we create clouds of learning – ephemeral, shape-shifting, finite but with indefinite boundaries – and then try to count them up like building blocks.  It leads to all kinds of problems.

Here are some things to think about:

The value of marks. 

Screen shot 2014-05-25 at 14.07.34

It’s so normal to us to set tests where marks are awarded for certain questions.  We go through a process of allocating marks to questions more or less on the basis that the marks are broadly equal in size.  Two students gaining 14/20 on a test could be assumed to have performed equally well – or to have the same level of knowledge. But look at this GCSE Chemistry paper.  There are five different ways to gain a mark.  The unit size for each mark is different – there’s no meaningful way for them to be the same; concepts can’t be measured in terms of difficulty in that way.   In the question below, six marks are available in a more open-ended response:

Screen shot 2014-05-25 at 14.00.21
This question is worth 6 marks.

Why 6 marks?  Why would a full response to this question be more valuable than five correct answers as given above? It’s a matter of judgement from an expert paper-setter but there is no absolute sense in which this is a truth.  Of course tests  like GCSEs are designed to cover a range of ability levels; we need a range of easy and difficult marks so that the papers give everyone a chance to show their knowledge.  Question 6 might well be worth more marks but that would overly penalise those students who found it difficult.  So, marks are deliberately weighted  to create graduations that allow for the assessment to function as a whole.  We assume that by averaging up the scores, we smooth out the variation in difficulty and, as it’s the same for everyone, we don’t worry too much about the individual mark values.

But, take it to a boundary scenario – eg student A with 41/60 and student B with 42/60 , where 42 is the grade boundary.  The assumption that student B has ‘more knowledge’ or ‘knows more facts’ is highly dubious.  It really depends on which questions they got right.  It’s a fact that, with a different combination of equally plausible mark weightings, students A and B could reverse their scores.  And yet that cliff-edge will have real-life consequences for those students.  There’s an element of arbitrariness in the mix that doesn’t have enough recognition in the whole process.  What is a mark worth and why? We should ask ourselves this more often.

There is no correct score for an essay.

Having spoken to senior people from OfQual last year and having been through the painful post-results AQA appeal process documented here, I am now fully aware of the idea that  there is no correct mark for an essay.  An English essay marked out of 50 may score 33 or 34 or 35 or 36, depending on the judgement of the examiner.  That’s not unusual.  Why? Because it’s difficult for two people to reach precise agreement mapping several hundred words of writing against the criteria.  There is a judgement factor that is inherent in the very concept that an essay demonstrates understanding at a certain standard. It’s not a free-for-all but there is a legitimate level of uncertainty or tolerance.  So, if a paper is re-marked, there is a good chance that the second marker will score it more highly than the first; this doesn’t mean either marker is wrong.  But, imagine if schools submitted all papers for second marking – and demanded the highest of any marks given; it would push up everyone’s marks systematically.  This is why there is a good case not to give students their papers back after exams – unless the marking discrepancies are much more significant.

Alternatively, if we want to feel that essay-based subjects are assessed with more precision, we’d need to re-think the idea of essays altogether. Why is an essay the best way to assess English Lit or History or Politics? Why not multiple choice tests or tests more like those in science?  Isn’t it because essay-writing is an integral part of the tradition of those disciplines – the open-ended nature of constructing an argument backed up with evidence? So, if we stick with essays, we need to accept the uncertainty principle and its consequences. On the other hand, if we want more precision, then perhaps essays aren’t the way to assess the learning.

Anecdote: The value of an essay grade.

My colleague Tim (@musotim) shared this story last week: A (now retired) colleague was challenged by a parent at parents’ evening: Why is this essay only worth a B? The teacher looked at it and then proceeded to explain how he’d applied the assessment criteria, leading to a B being the best fit.  The parent wasn’t happy. ‘I still think it’s worth an A’.  The teacher, very wise but slightly weary, said ‘OK’. He took the essay back, got out his pen and wrote ‘A’ on the top before handing it to the parent. ‘There you are’.  Had this changed the quality of the essay? No.  Did it change the true value of the essay? No.  It was a Pyrrhic victory for the parent – who knew this all too well.   Tim and I laughed a lot about that – and all the underlying implications. What is an A? How can we be sure? And who is it for?  We need to think these things through.

Grade boundaries. Fuzzy clouds with cliff edges of uncertainty.

We all know that grade boundaries are created by a statistical overview of standards (all standards are derived from the bell curve, like it or not.)  But  we’re not as up-front as we should be about the blurred edges. It feels right to use impressionistic grades to describe the standard of students’ work. That’s about an A; that’s probably an A*; that’s more like a B.  Here we’re conceiving of a grade as a general range – more like a cloud. In fact, we’re happy to conceive of these standards overlapping – students rarely perform absolutely consistently at a certain standard and any piece of work has strengths and weaknesses.  But, despite our intuition for the fuzzy edged reality of grades as standards, we’re still likely to assume that a student with an official A has definitely done better than a student with a B.  We load up the final grades with meaning.  I’ve argued  that all exams should be reported with scores.  At least 49/80 and 50/80 can be seen for what they are – very close; actually pretty much identical.  That change is unlikely to happen because the consensus seems to be that the satisfyingly fuzzy breadth of A, B and C (soon to be the 1-9 scale) overrides the awkward precision of a grade boundary.  I’d argue that the fuzziness around an apparently more precise score with the freedom from any grade boundaries at all would be a better application of the uncertainty principle.  We can’t have both.

Screen shot 2014-05-25 at 09.15.22

The illusion of linearity: KS3 Assessment and residual devotion to NC Levels.

With National Curriculum levels on the way out, schools are busy trying to work out alternatives.  So far, the signs are that there’s huge inertia and most schools will be sticking to what they know for a good while yet.

Screen shot 2014-05-25 at 09.15.53There is a seductively reassuring illusion of steady progress afforded by the Level Ladder concept.  We go from L5 to L6 and then L7.  But we can do better- recognising that each level is broad, because we have sub-levels. So across the nation parents are presented with reports littered with the  linear steps of 5b to 5a to 6c to 6b.  The truth is that sub-levels are simply a device to create a sense of progress- they are not measures of absolute standards.  We have no meaningful way of identifying that, in History for example, 5b to 5a is the same size jump as 6c to 6b. ‘Size’ in this context is a bizarre notion.  And yet the steps on the left are presented as the path of progress – as if each step has the same size, not only within subjects but between them too.  It’s reasonable to assume 6b in Maths is much the same level as 6b in Geography (as if those things can be compared in absolute terms.)   In reality the middle and right hand steps might be closer – who knows?  Of course the bell-curve brings things into line one way or another but, all too often we forget that levels are not real – they are a device and nothing more.   It should be a relief to schools that NC levels are going  – because now we can invest energy in developing our own devices without the pretense that we’re measuring in terms of absolute standards determined at a national level – when we’re not and never were.

If we think of levels as rough standards exemplified through moderated pieces of work and sets of level-defining questions, then a loose set of rising clouds is a better metaphor than a set of steps.  An a,b,c code could be added but only as another loose indicator of movement.  It’s a code for a general, subjective notion of progress; not a measure of steadily rising standards.  The assessment uncertainty principle can allow for the former, but not the latter.

Screen shot 2014-05-27 at 00.11.22

 

Levels of Progress

Once we apply the concept of uncertainty to levels, ‘levels of progress’ is very shaky as a meaningful measure. Sub-levels of progress is almost ‘finger in the air’ stuff.   If we want to see how much progress students have made, we should look at their work and the questions they can answer and compare that over time.  The data code for that process will only be a crude proxy – never an actual measure.  On that basis, the way we will need to show that our students are making progress is by comparing our ‘real-work progress files’ with those from our statistical neighbouring schools.  I’ll show you mine if you show me yours.  What do your highest attaining Y9s do in Geography? How does it compare to what we do.  Let’s take a look.  This is what working within the uncertainty principle allows; like-for-like comparison of real work is the most meaningful way to compare standards.  Once we’ve opted for the data-proxy, our uncertainty dial goes off into the red – Maximum Validity Skepticism.  Klaxon!!

Communicating Standards to Students:  Focus on the work itself.

Screen shot 2014-05-25 at 09.44.28

I’ve seen teachers go to great lengths to communicate standards to students via level ladders or grade descriptors.  They invest time and energy showing the students what a level 5 might look like, involving them in assessing their work to see whether it is level 5.  Then they explain the features of generic Level 6 and hope that they then take that on board in order to produce improved work.  It’s all logical enough – but it’s so convoluted.  Why not just go direct.  If you can show a student how to improve their work – why does it help or matter if they can label the work with a level before and after?  In my experience, the energy spent on explaining the levels is often dissipated beyond the point of being useful; teachers do it because they think they should or because it’s something to put in their mark books. Actually only the most sophisticated learners can use the generic criteria meaningfully to improve their work.  I’d suggest we could cut out a lot of this quasi-proxy-meta assessment coding and stick to the organic, intrinsic properties of the work.  The grading can come later.

Drama, PE and Science Practicals and Speaking and Listening in English.

My final thoughts are about the way we’ve been trying to address uncertainty in different examinations – leading to them being compromised.  Where we’ve known that the criteria are heavily open to interpretation, we’ve tended to set out some tight criteria that help students to get the best possible grades.  However, over time this had led to an artificial squashing of the overall outcomes, skewed to the high end.  In science practical assessments at GCSE, the practical skills and written analysis have become so formulaic that they’ve lost meaning as a way to set standards.  The English Speaking and Listening assessments were similar.  If you give everyone a recipe for a cake, then you’re going to get a lot of decent cakes; that doesn’t mean that everyone can bake.  In Drama and PE, the issue is that the practical components are hugely subjective.  How do you measure being good at rugby? How can you consistently assess the quality of students’ individual drama performances comparing different group performances of pieces with different dynamics?  But surely it’s worth exploring alternative methods of assessing these subjects rather than trying to tighten them up so they look more like the others.  If we created a tighter assessment regime for Drama – what would the subject look like? We’d have lost the essence of it in order to create a more precise exam.  Or we could simply embrace the uncertainty for what it is and live with it.  That doesn’t mean it is soft or easy; it is what it is – a subject where uncertainty is an intrinsic property.

Other Posts with related content:

http://headguruteacher.com/2013/07/17/assessment-standards-and-the-bell-curve/

http://headguruteacher.com/2013/03/17/the-data-delusion-on-average-its-a-bit-more-complicated/

http://headguruteacher.com/2013/03/22/data-delusion-solutions-part-1/

 

28 comments

  1. Hi

    A good article some good ideas. I agree about focusing on the work and next steps.

    The challenge is how this translates to something for parents to understand.

    Also how do you validate that what is being assessed will keep learners on the correct trajectory to achieve expected or above expected progress.

    Do you have any thoughts?

    Sean

    Like

    • That’s a good question. At KEGS we use a simple school-defined *,1,2,3 system for that with each department writing criteria fort the expected standards. In a comprehensive school that would be more complicated but not impossible. Levels could still be used as broad milestones with some overarching ‘on track/off track’ indicator based on the quality of work. The main thing is to define expected progress against something tangible – a mark on a test with questions of a certain standard or some samples of work, rather than a proxy code that is hard to explain.

      Like

  2. Great article, putting all those misgivings about grades + levels into words. This bit though – ‘The energy spent on explaining the levels is often dissipated beyond the point of being useful; teachers do it because they think they should or because it’s something to put in their mark books.’ – I would say most teachers do it because it’s demanded of them and their SLT expect to be able to walk in class and have the students be able to recite their levels on demand – and that this has come from an interpretation of Ofsted demands.

    Like

    • Yes – that thing of knowing your level is one of the greatest dollops of nonsense we’ve had to swallow over the years. What level are you on? 5a. What you are aiming for? 6c. Gives the illusion of engaging students in the assessment process. If you ask them what it is about their work that needs to improve to secure the next level up, it’s a different story.

      Like

      • Nonsense if that is all we are doing – teaching them level numbers. Surely though good/outstanding teachers are teaching the key skills behind these numbers and explicitly explaining them to students, or, as you would say, “going direct”. The numbers are for reporting to parents and monitoring progress across subjects, cohorts and schools. Williams is spot on to tell us to leave out the numbers when assessing students work, but tax payers (parents) are entitled to some form of performance indicator, in addition to good quality formative comments on the reports home.

        Like

      • Well it actually wasn’t quite like that. It was formulaic, true, but it was very clear for the pupils. You’re a 4a. You’ve grasped this, this and this. What do you need to do to get to a 5? THIS!

        Like

  3. Reblogged this on Carol's Learning Curve and commented:
    Heisenberg would be proud! The uncertainty principle and a bit of fuzzy logic applied to assessment… How to maintain the rich integrity of learning when grading systems strip things down to a mechanical skeleton…

    Liked by 1 person

  4. Assessment is vague, granted. All language is intrinsically vague too, Wittgenstein pointed this out: at what point do grains of sand become a pile of sand, what are the defining characteristics of a game? Vagueness is something we deal with all the time as humans, and mostly very successfully. However, although language is not perfectly underpinned with scientific definitions for everything we talk about we still succeed in being very precise when we need to be. I agree that sub-levels push the accuracy of assessment beyond its limits by making the measurement of learning look more precise than it ever can be, but conversely I don’t think a “strong level 6” and a “low level 7” are meaningless concepts either. It might well be true that we need to assess less frequently, and claim less accuracy for our assessments, particularly when they are based on nothing more than instinct, but we mustn’t forget that an important part of student motivation comes from the sense of “flow”, the satisfaction of mastering harder concepts and improving their skills. You can’t do this without a feedback framework that tells you whether you have succeeded in meeting a challenge: assessment is vague, but it’s still necessary.

    Like

    • Thanks for the comment. I agree that assessment is necessary – we just need to be cautious about the absolute nature of our codes. One person’s low L7 will be another person’s strong L6 – just like the old style A-/B+. It’s highly subjective at that level -and as long as we recognise that we’re fine.

      Liked by 1 person

  5. […] The Assessment Uncertainty Principle is one of my favourite posts.  We must not allow the illusion of fine tuned assessment to be created by sub-steps and fine-grading.  Assessment is fuzzy and anything we do that suggests otherwise needs to be recognised and handled with care.  Is a student on Grade 7 in our system, necessarily achieving at higher level than someone awarded Graded 6?  Well, no.  We’re simply projecting teachers’ judgements and offering a best guess based on our estimates to give a rough idea.  Within the detail, a test score of 63% in Maths is going to be more meaningful, but actually, even then, only when we look at which questions the student got wrong and why. […]

    Like

  6. […] The Assessment Uncertainty Principle is one of my favourite posts.  We must not allow the illusion of fine tuned assessment to be created by sub-steps and fine-grading.  Assessment is fuzzy and anything we do that suggests otherwise needs to be recognised and handled with care.  Is a student on Grade 7 in our system, necessarily achieving at higher level than someone awarded Graded 6?  Well, no.  We’re simply projecting teachers’ judgements and offering a best guess based on our estimates to give a rough idea.  Within the detail, a test score of 63% in Maths is going to be more meaningful, but actually, even then, only when we look at which questions the student got wrong and why. […]

    Like

  7. Thanks for the blog and recognition that the system is flawed in the claims that it was making. I think I would argue, as you have said, that for the student the only important questions are:

    1. What can I do?
    2. Is there evidence to show that this is a good judgement of what I can do?
    3. What do I need to do next?

    The kerfuffle came about because we then because obsessed with a number of things:

    1. The assessment data was a reliable and accurate measure of teachers’ and / or schools’ performance
    2. There was a problem with too many people doing well
    3. The ongoing confusion between normative and grade related critera.

    Like

Leave a comment