Here’s my annual twitter joke for GCSE results day:
DfE media guidance. Response to GCSE outcomes:
- If results have gone up: This is evidence of policy raising standards.
- If results have gone down: This is evidence of policy raising standards.
It’s a clever trick – but it’s not actually wrong. A case could be made for both statements being true. We can talk about ‘raising standards’ to mean setting higher benchmarks for the standards we expect or we can talk about the standards that have been met – but they might not be the same thing at all. Nobody is better at maths simply because they sat harder exams – especially if they failed it.
I am a supporter of the exam and curriculum reforms in general but the truth is that our education system is permeated by confusion (and a fair degree of ignorance) about what standards are, how they are measured and what GCSE grades tell us.
As the table above shows, in broad terms, despite the rhetoric around tough new GCSEs – and the real experience of teaching the new tougher curriculum and of students sitting the tougher exams – the grade award patterns between the two systems are very similar. There is a slight dip in top grades and a marginal drip in the A*-C/4-9 pass. This doesn’t look like the scale of dip you might expect if the grades represented fixed standards. There is a mis-match between the rhetoric/reality around the new toughness and the apparently similar distribution of grades.
Evidence of the toughness comes from the marks. You needed only 52% on Higher Maths to gain a Grade 7 – directly equivalent to an old A. That’s seems very low – meaning that the exam was hard. So, with this year’s results, we now have a very clear indication that grades do not tell us about the relative difficulty of the assessment, the standards in an absolute sense – the challenge of the material being studied. Primarily they mark positions on the bell curve. The test was harder but there are still always 50% of students in the top half! It’s not literally that simple but it’s close. The need to ensure that this year’s students were not disadvantaged by changes – the fairness principle – has driven a very strong process to keep the bell curve markers in the same place. But that makes it pretty confusing.
So here is the big question for me: After this year, how will we know if standards of student performance in English and Maths have gone up? It is likely that the grading machinery will keep things very stable. Should we look to see if marks on similar papers increase or will be just associate higher raw marks with the exams having been made easier? The same merry-go-round. The DFE and Ofqual will need to communicate this well. Are we going to allow some grade inflation to creep back in so that we can all believe standards of student performance are rising? Will the reference testing process be given more publicity so we all know how it feeds into the whole machine?
Competing demands on measurements
It’s an interesting challenge to understand the examination grading process; it’s an even bigger one to communicate it to students and parents. We want educational outcomes to be measurable so we we know how well children have done. But we’re so deeply conflicted about whether this means recording what they know and can do – or measuring how well they did compared to everyone else.
These charts are familiar to any parent. They contain both types of information: absolute and comparative. It’s more meaningful to tell a parent that their child is in the 50th percentile than it is to say, nothing to worry about Ms Smith, your child’s length is 65 centimetres. Why? Because 65 cm could be either very high or very low depending on the child’s exact age; it’s only the comparative measure that is useful. Educational measurements are very similar.
As I’ve explored in various posts – including Assessment, Standards and The Bell Curve, there are very few educational measures that are truly absolute: we almost always need to refer to the bell-curve if we want to base our ideas about standards around the notion of difficulty.
One of our problems is the widely held sense that any awards we give are cheapened if too many people get them; if they appear freely available. We want everyone to do well, but as soon as too many people start to succeed, the shout out isn’t one of celebration – it’s the hand wringing of standards being watered down. This isn’t necessarily unfounded. The extent of grade inflation in the 90s and 00s was certainly not matched by a parallel rise in the fundamental levels of educational performance of our young people. Separating real improvement from inflationary improvement is a core technical issue for examination systems everywhere. (Reference tests are part of the solution -these are happening in the background but we don’t get told much about them).
Another problem is our strong commitment to the idea of passing or failing exams. Discourse around educational standards is littered with references to the pass/fail threshold. For decades the C grade has represented this pass and there has been a right old farce surrounding the Good Pass/Standard Pass issue in the new system. There is no intrinsic reason for any grade to be a pass especially given that, in a norm-referenced system, not everyone can achieve a 4 or above. Nicky Morgan snatched defeat from victory (stupidity from sense), when she immediately branded 5-9 as Good passes in the new system. This was our chance to say: every grade counts; each grade represents learning to different standards that have value for what they are. It would have been perfectly possible to set Grade 5 as a benchmark level for looking at school improvement or a sensible baseline standard for moving onto A levels without needing to call it a pass (and the others a fail.)
This is like making every beginner piano student take Grade 5 – so that large numbers fail – instead of setting standards at different levels that students can pass. If only our examination system was like that. Too late now…. for a decade at least.
The Zero Sum Effect
Another important issue to wrestle with is one I raised in the Bell-Curve Cage post. Despite the rhetoric around the new toughness and the strongly anchored bell-curve, we are still meant to swallow the delusion that all schools are meant to show improvement. As the graph below illustrates explicitly, there are winners and losers. But still we are not allowed to talk in terms of a zero-sum. This is tinfoil-hat territory. But every school that gets a results boost can only do so at the expense of another. That’s what a stable system looks like. And yet….. oh gosh. The fierce, outrageous high-stakes accountability system takes schools to pieces for a decline in standards. How loud do we shout it: WE CAN’T ALL BE ABOVE AVERAGE. WE CAN’T ALL GET BETTER GRADES.
The Challenge of Tiered Exams.
After some of the PiXL experiments with trial papers and boundaries, I was interested to see how this panned out in reality:
It’s complicated isn’t it? To my mind, the outcomes of the tiering and the Level 4 thresholds suggest that two tiers isn’t a good model for this range of performance. I’m not suggesting we can change it; it’s too late for alternative models. We just need to learn to deal with the fall-out. The educational experience of students who end up gaining less than 20% of the marks in an exam is brutal: nevermind the examination experience itself. I don’t think there is a sound basis to say that a student knows X and can do Y in maths – representing a Grade 4 ‘Pass’ – when they will have got 15 out of 80 marks on a paper (aside from the horrible cliff-edge that then 14 out of 80 is definitely below a Pass.)
The challenge is to create assessments that tackle the spread and allow us to measure across the whole range of outcomes – but exams are not like rulers where you can simply add up the heights.
The two-tier approach is a bit like trying to streamline the ABRSM piano exams into two levels of exam – say the Grade 3 pieces and the Grade 6 exam pieces. Instead of beginners playing Grade 1 pieces and passing, they would do the Grade 3 exam, find it very difficult and get a score that is equated to a Grade 1. Similarly, the top end students don’t take Grade 8, they take Grade 6 and if they absolutely ace it, they are awarded Grade 8; if they struggle with the pieces but make at least some sense of them, they might get a Grade 4. There’s then a messy overlap: doing well on Grade 3 or badly on Grade 6.
Personally, I think it would be better if exams had Core and Extension papers – or if the Higher and Foundation marks were morphed into a single scaled score. In general, I think we should move to scaled scores of say 1-100 instead of grades. The cliff edges we live with are unacceptable really. Sadly there’s no chance of that ever happening.
The most important thing now is for there to be a moratorium on change. Arguably the process of implementing all the reforms over the last five years, whilst potentially allowing standards to rise in the future, has prevented us from actually raising standards overall so far. There is no evidence that our education system as a whole is delivering higher standards now in 2017 than we had in 2010. Part of the reason for that is that too much emphasis is placed on pieces of the system – school structures, a few celebrated new schools, a few top-end success stories, the cult of Outstanding etc. We have not been sufficiently focused on how to make students better at maths, better at English, better at science. Better in the sense that they know more; truly better in absolute terms.
This is about developing leaders, developing teachers and creating the conditions that motivate people to stay in the job long enough to make a sustained impact. I think the assessment paradigm shift I discuss here is part of it too. Let’s see if that’s the emphasis now the reform storm has nearly passed.