Increasingly I am becoming frustrated by the lack of sophistication that is applied to the whole process of evaluating educational outcomes. As a consequence, all kinds of perverse and spurious conclusions are drawn and school, teachers and policy makers end up jumping through hoops that have no real basis. If we’re not careful, we’re going to lose sight of what matters….if we haven’t done so already.
I will try to illustrate the point… always conscious that inevitably I will be over-simplifying, so please bear that in mind.
There are two major issues with the measurement of educational outcomes:
- The things we are measuring – knowledge, skills and understanding – are essentially, for the most part, intangible, ephemeral and invisible; brains are very complicated and we don’t really understand them. As Dylan Wiliam is fond of saying: “Learning isn’t rocket science; it is much more complicated than that”. We often fall into the trap of assuming that our measurements capture the extent of learning, thus limiting our view of what learning is to that that is measurable. We are simply not good (on average!) at dealing with learning that is beyond the scope of our measurement tools.
- Everything we do in schools, everything that constitutes learning, is subject to our values system. What we value in terms of learning outcomes is not absolute and as humans with different world views, living in a democracy, we have to work hard to arrive at a consensus about what matters. Pythagoras’ theorem may be one of the universal truths of space but whether it matters is something we decide; knowing it and being able to use it are different and, again subject to our values.
So, in this context of complexity, naturally enough, we try to create order. It is sensible enough to agree on a curriculum defining things that should be known and understood (whether the Government should decide this or not is another issue.) On the micro-scale of simple questions, it is meaningful to assess learning against the curriculum objectives:
- An understanding that momentum is conserved in collisions
- The ability to spell ‘disaggregate’ correctly
- The ability to write the symbol or word equation for Photosynthesis and use it to explain various features of plant growth.
- An awareness of the key events of World War II
- The ability to understand “Fortiter ex animo” or “Wir könnten Bowlen gehen”
More subtly, we can assess complex accretions of learning
- The quality of writing in an essay; whether it is coherent and makes a good argument
- The success of an evaluation of multiple causal factors and their inter-relationship in determining the key reasons for a certain outcome (in any subject)
- The skill and originality in producing a composition.
From questions to basic tests and assessment tasks right up to long exams and extended pieces of work, we do what we can to make sense of what has been learned and to give it value. This is nuanced; every answer to ‘why does your heart beat faster during exercise?’ is different – even if there is an objective truth. Try it! My Y10 daughter complained the other day that in English Literature ‘they say there is no right answer…. but there always is!’ The truth-values interplay is an everyday experience for learners and teachers.
However, – here is the point at which we start to lose meaning – in seeking to capture the essence of our assessment in order to communicate and record it, we continually attempt to make something complicated, very simple and we turn real meaning into a code: data. Superficially this is innocuous but ultimately, unless we’re very careful… all manner of distortions arise. Another Dylan Wiliam quote: “A man with one foot in boiling water and another in freezing water is not, on average, comfortable”.
14/20 or 65% on a maths test: Already, we’re losing sight of what was learned, settling for an overview. Two students can get the same score – with completely different wrong answers; same score – entirely different learning. Obviously 80% on a ‘hard test’ is better than 80% on an ‘easy test’… so we compare with others; what starts as a record of success in gaining correct answers, becomes a statement of relative performance, leading to the bell-curve approach. In truth a lot of assessment is relative; not absolute.
4/6 for a question or 23/30 for an essay: Turning a set of ideas into words is a messy process; ascribing a scale to that is messier – so we need criteria. A 6 mark answer is hard enough to define relative to 5 marks; 23/30 is basically meaningless unless we can separate 23 from 22 or 24 with some consistency. Moderation meetings for essay-based subjects are interesting! AQA markers for English A level papers can differ by 30 marks out of 80 in their assessments. What does it all mean??
Level 6c on a single piece of work or a Y8 report. National Curriculum levels were designed and defined as a set of attainment statements related to a whole key stage. In science, how magnesium reacts with oxygen without mass being lost is a piece of knowledge a student might learn. There is no sense of any kind that this can be ascribed a level on a par with some other bit of knowledge. None. The assumption that the depth of learning goes up in linear steps or that the steps are of equal size within one subject is a fabrication to create the illusion of progress over time. To assume that the levels have parity across subject disciplines is also pure delusion. It is literally without meaning. And yet… Y8s across the country are being told they are at Level 5a and need to progress to 6c.. We’ve made it all up. It just means – learn more; go deeper; express it in a more sophisticated manner. The level ladders are a super-crude code that is divorced from real learning where the biggest variable by far is the teacher’s interpretation. At my school we’ve devised our own system that makes sense for us; we didn’t feel we could play the levels game.
70% making ‘three levels progress’ : Despite the house of cards of Levels, ‘Levels of progress’ has become a key OfSTED measure. It may well be that moving from L3 to L5 is harder than L4 to L7 in some areas of learning; factor in the assessment error and you have a measure from one massive averaged uncertainty to another massive averaged uncertainty. A statement like ’75 % of students made 3 levels progress’ tells you almost nothing about the learning that has taken place or how good the school is. We are projecting meaning on to something that isn’t there….. we really are.
Grade B on a piece of work: an essay, a painting, a science investigation; in an exam. I had a discussion with a teacher about why he gave B+/A- for essays. In his head, this was consistent… a B+ is definitely not an A- and he would give the same grades consistently. I have every reason to be very dubious about this…. Grading is very clearly not an objective, absolute process. Grading only has meaning in reference to the cohort – the dreaded norm referencing. If you think you can define a grade with some criteria that can be tested accurately, you’re doing better than any exam board and most teachers. I’ve devised tons of tests. We give scores and %s and then allocate grades. How? By seeing how the mean and ranges of scores compare with other data sets. Test scores might range from 10/100 to 90/100.. Here you can see that dividing up into grade regions might work. But, when the range is 65/100 to 75/100…. How meaningful is it to say the students performed at different enough levels to warrant different grades? Well, this happens all the time.. A recent GCSE PE exam at my school gave A-D grades for scores from 62/80 down to 56/80.
The grade boundary cliff is another issue. For any exam, in UMS terms, 70 might be an A, and 69 a B; here the difference in learning is marginal – zero in all reality within the limits of accuracy– but the boundary wall gives massive un-founded significance to a hair’s breadth on an artificial scale. And look what happened to English GCSEs last summer. Catastrophic gerrymandering – bursting the bubble of hope people had created that grades were about objective standards… instead of rank order. Well now we all know.
All the stuff about ‘working-at grades’ is also highly dubious. In GCSE Physics, my subject, there is no sense in which a student starts at grade C, moves up to B and eventually reaches A. The grades are based on norm-referenced bell curve analysis of overall grades in final exams. At no point is there a C grade until the end… All I can do is evaluate whether they are on the path towards an A…but they are never at C or B. Again, this is artificial and needs to be seen as such.
3As, 5Bs and 2Cs for a student.: Next, we aggregate all of this up: we turn scores into UMS into Grades for various exams giving a student an overall set of grades. What does this tell you about what they learned or what they can do? Very little – except in reference to how the testing system with all its statistical distortion factors and errors compared them to everyone else. I think it is ironic that content/knowledge purists also often advocate a testing regime that produces an output that actually only gives you a general overall sense of what a student’s general learning capabilities might be within the specific parameters of a test; ie it doesn’t tell you anything about what they know or can do.
60% 5A*-C for a whole school. Continuing up the chain, we end up defining an entire school – all learners, all learning, everything… in a single pieces of data. 60% 5A*-C averages out everything we know about learning to the point of oblivion. A school were 60% of students got exactly 5Cs looks similar to one with 60% of students gaining a mixture of A*-Cs including lots of A*s from a mixed intake. Most recently, OfSTED squeezed every last drop of meaning out of the whole edifice by putting schools into banks of ‘similar schools’ and ranking them into Quintiles on the Data Dashboard. Here schools with broadly similar %5A*-C scores (a few % apart) can be ‘top quintile’ and ‘bottom quintile’. Here, the plot has been utterly lost. There is no reproducible, meaningful sense in which schools’ outcomes can be processed in this way and convey a sense of the quality of learning or the overall educational experience.
A value added score of 986.7 or 1016.3. In science, we teach students about measurement: accuracy, precision, resolution, reproducibility.. and so on. It is standard practice to evaluate errors in measurement and to take care not to over-state the precision in a final result, relative to the size of errors. For example, if your stop-clock only measures in seconds you can’t say the time for a feather to drop is 8.63 seconds. If you measure 100 drops, the calculator may tell you the average is 8.63 seconds but your apparatus is not up to the job of giving you that level of precision. If the reaction time suggests an error of a 1 second, the best you could hope is for an answer of say 9 seconds +/- 1 second. But.. do the DFE and OfSTED understand this? No they do not. The VA algorithm is deeply flawed. School A: VA = 995.2 +/- 11.6 School B:VA = 1002.1 +/- 13.7 (not uncommon) We are expected to believe School B adds more value than School A… but the errors suggest we cannot make that claim. Data garbage presented as truth.
Effect Sizes of 0.29, 0.65 and 0. 84 My final bit of Data Delusion is the new and growing search for reproducible and reliable educational outcomes from research. Hattie and Petty, amongst others, have done work in this area and, for me, the outcomes are interesting. Over many studies, the rank order in effect sizes (derived from standard deviation calculations) leads to a set of high-impact strategies that ring true for me with my subjective bias and values. I should be really pleased. But, unfortunately, there is already an overwhelming tendency for people to take these figures at face value. To begin with their figures are an average. If an effect size is 0.65, no single study may have yielded that outcome; the range may have been 0.1 to 1.2..as the contexts shifted and changed. Then, the level of precision implied by the second decimal place gives the impression that a 0.65 effect is somehow meaningfully higher than a 0.62 effect – and I have heard people take the surface rank order as gospel truth. Of course, these are historical, retrospective averages. They tell you nothing about what might happen in any specific context beyond this: some initiatives, on average, in the past, have been shown statistically in the particular tests that were done, to exhibit this general pattern. It might , therefore, be worth looking at these strategies so see if they also work well in your context.
As I have shared in this post – what Hattie says about homework is complex. And yet, even intelligent people will tell me that 0.29 is a low effect size, therefore homework is a bad strategy. It makes me weep…..
What we need is an intelligent view of assessment that takes account of the distortions inherent in any measurement process; that is capable of embracing the idea of ‘error’ and that does more to link our assessments to the original learning. Teachers, leaders, inspectors and politicians need to avoid placing high value on data in ways that cannot be sustained.. Learning is fuzzy; it is complex…..let’s embrace that and not reduce it to something where all meaning has been lost. It is like dropping a bag of marbles. Even when we know, in physics terms, all the laws that determine the motion of a dropped marble, there are so many variables at play that we cannot predict how the marbles will fall and roll. We can see a pattern, we can look at limits.. but we can’t describe the detail. If we look at the final resting places of a bag of dropped marbles, similarly, we can’t extrapolate backward to know exactly how they got there. It this is true for a simple bag of marbles… for learning, it is even more complex. Let’s recognise that.
To finish: An unrepresentative anecdotal cautionary note: Another dodgy data dimension is around the process of judging lessons and schools through OfSTED inspections and lesson observations. I know an inspector who is quite happy to tell teachers their lesson is Good (not Outstanding)…because of the quality of learning he observes and the levels of progress made. This man is a creationist; he thinks living things were placed on Earth by a higher being and says ‘I don’t agree with all that evolution stuff’. Nevermind the evidence. If he comes to your school, you’re in serious trouble!