The Data Delusion: On average, it’s a bit more complicated.

With apologies to Richard Dawkins...
With apologies to Richard Dawkins…

Increasingly I am becoming frustrated by the lack of sophistication that is applied to the whole process of evaluating educational outcomes.  As a consequence, all kinds of perverse and spurious conclusions are drawn and school, teachers and policy makers end up jumping through hoops that have no real basis.  If we’re not careful, we’re going to lose sight of what matters….if we haven’t done so already.

I will try to illustrate the point… always conscious that inevitably I will be over-simplifying, so please bear that in mind.

There are two major issues with the measurement of educational outcomes:

  1. The things we are measuring – knowledge, skills and understanding  – are essentially, for the most part, intangible, ephemeral and invisible; brains are very complicated and we don’t really understand them.  As Dylan Wiliam is fond  of saying: “Learning isn’t rocket science; it is much more complicated than that”.   We often fall into the trap of assuming that our measurements capture the extent of learning, thus limiting our view of what learning is to that that is measurable.  We are simply not good (on average!) at dealing with learning that is beyond the scope of our measurement tools.
  2. Everything we do in schools, everything that constitutes learning, is subject to our values system.  What we value in terms of learning outcomes is not absolute and as humans with different world views, living in a democracy, we have to work hard to  arrive at a consensus about what matters.   Pythagoras’ theorem may be one of the universal truths of space but whether it matters is something we decide; knowing it and being able to use it are different and, again subject to our values.

So, in this context of complexity, naturally enough, we try to create order.  It is sensible enough to agree on a curriculum defining things that should be known and understood (whether the Government should decide this or not is another issue.) On the micro-scale of simple questions, it is meaningful to assess learning against the curriculum objectives:

  •  An understanding that momentum is conserved in collisions
  • The ability to spell ‘disaggregate’ correctly
  • The ability to write the symbol or word equation for Photosynthesis and use it to explain various features of plant growth.
  • An awareness of the key events of World War II
  • The ability to understand “Fortiter ex animo” or “Wir könnten Bowlen gehen”

More subtly, we can assess complex accretions of learning

  •  The quality of writing in an essay; whether it is coherent and makes a good argument
  • The success of an evaluation of multiple causal factors and their inter-relationship in determining the key reasons for a certain outcome (in any subject)
  • The skill and originality in producing a composition.

From questions to basic tests and assessment tasks right up to long exams and extended pieces of work, we do what we can to make sense of what has been learned and to give it value.   This is nuanced; every answer to ‘why does your heart beat faster during exercise?’ is different – even if there is an objective truth.  Try it! My Y10 daughter complained the other day that in English Literature ‘they say there is no right answer…. but there always is!’ The truth-values interplay is an everyday experience for learners and teachers.

However, – here is the point at which we start to lose meaning –  in seeking to capture the essence of our assessment in order to communicate and record it,  we continually attempt to make something complicated, very simple and we turn real meaning into a code: data.  Superficially this is innocuous but ultimately, unless we’re very careful… all manner of distortions arise.  Another Dylan Wiliam quote: “A man with one foot in boiling water and another in freezing water is not, on average, comfortable”.

 14/20 or 65% on a maths test:  Already, we’re losing sight of what was learned, settling for an overview.  Two students can get the same score – with completely different wrong answers; same score – entirely different learning.   Obviously 80% on a ‘hard test’ is better than 80% on an ‘easy test’… so we compare with others; what starts as a record of success in gaining correct answers, becomes a statement of relative performance, leading to the bell-curve approach.  In truth a lot of assessment is relative; not absolute.

4/6 for a question or 23/30 for an essay: Turning a set of ideas into words is a messy process; ascribing a scale to that is messier – so we need criteria.  A 6 mark answer is hard enough to define relative to 5 marks; 23/30 is basically meaningless unless we can separate 23 from 22 or 24 with some consistency.  Moderation meetings for essay-based subjects are interesting!  AQA markers for English A level papers can differ by 30 marks out of 80 in their assessments.  What does it all mean??

Level  6c on a single piece of work or a Y8 report.   National Curriculum levels were designed and defined as a set of attainment statements related to a whole key stage.  In science, how magnesium reacts with oxygen without mass being lost is a piece of knowledge a student might learn.  There is no sense of any kind that this can be ascribed a level on a par with some other bit of knowledge. None.  The assumption that the depth of learning goes up in linear steps  or that the steps are of equal size within one subject is a fabrication to create the illusion of progress over time. To assume that the levels have parity across subject disciplines is also pure delusion.  It is literally without meaning.   And yet… Y8s across the country are being told they are at Level 5a and need to progress to 6c.. We’ve made it all up.  It just means – learn more; go deeper; express it in a more sophisticated manner.  The level ladders are a super-crude code that is divorced from real learning where the biggest variable by far is the teacher’s interpretation.  At my school we’ve devised our own system that makes sense for us; we didn’t feel we could play the levels game.

 70% making ‘three levels progress’ : Despite the house of cards of Levels, ‘Levels of progress’ has become a key OfSTED measure.   It may well be that moving from L3 to L5 is harder than L4 to L7 in some areas of learning; factor in the assessment error and you have a measure from one massive averaged uncertainty to another massive averaged uncertainty.  A statement like ’75 % of students made 3 levels progress’ tells you almost nothing about the learning that has taken place or how good the school is.  We are projecting meaning on to something that isn’t there….. we really are.

Norm-referencing. Like it or not, this is what grades mean.
Norm-referencing. Like it or not, this is what grades mean.

Grade B on a piece of work: an essay, a painting, a science investigation; in an exam. I had a discussion with a teacher about why he gave B+/A- for essays.  In his head, this was consistent… a B+ is definitely not an A- and he would give the same grades consistently. I have every reason to be very dubious about this….  Grading is very clearly not an objective, absolute process. Grading only has meaning in reference to the cohort – the dreaded norm referencing.  If you think you can define a grade with some criteria that can be tested accurately, you’re doing better than any exam board and most teachers.  I’ve devised tons of tests. We give scores and %s and then allocate grades.  How? By seeing how the mean and ranges of scores compare with other data sets.    Test scores might range from 10/100 to 90/100.. Here you can see that dividing up into grade regions might work.  But, when the range is 65/100 to 75/100…. How meaningful is it to say the students performed at different enough levels to warrant different grades?  Well, this happens all the time.. A recent GCSE PE exam at my school gave A-D grades for scores from 62/80 down to 56/80.

The grade boundary cliff is another issue.  For any exam,  in UMS terms, 70 might be an A, and 69 a B; here the difference in learning is marginal – zero in all reality within the limits of accuracy– but the boundary wall gives massive un-founded significance to a hair’s breadth on an artificial scale.  And look what happened to English GCSEs last summer.  Catastrophic gerrymandering – bursting the bubble of hope people had created that grades were about objective standards… instead of rank order. Well now we all know.

All the stuff about ‘working-at grades’ is also highly dubious.  In GCSE Physics, my subject, there is no sense in which a student starts at grade C, moves up to B and eventually reaches A.  The grades are based on norm-referenced bell curve analysis of overall grades in final exams.  At no point is there a C grade until the end… All I can do is evaluate whether they are on the path towards an A…but they are never at C or B.  Again, this is artificial and needs to be seen as such.

 3As, 5Bs and 2Cs for a student.:  Next, we aggregate all of this up:  we turn scores into UMS into Grades for various exams giving a student an overall set of grades.  What does this tell you about what they learned or what they can do? Very little – except in reference to how the testing system  with all its statistical distortion factors and errors compared them to everyone else.  I think it is ironic that content/knowledge purists also often advocate a testing regime that produces an output that actually only gives you a general overall sense of what a student’s general learning capabilities might be within the specific parameters of a test; ie it doesn’t tell you anything about what they know or can do.

60% 5A*-C for a whole school.  Continuing up the chain, we end up defining an entire school – all learners, all learning, everything… in a single pieces of data.  60% 5A*-C averages out everything we know about learning to the point of oblivion.  A school were 60% of students got exactly 5Cs looks similar to one with 60% of students gaining a mixture of A*-Cs including lots of A*s from a mixed intake.  Most recently, OfSTED squeezed every last drop of meaning out of the whole edifice by putting schools into banks of ‘similar schools’ and ranking them into Quintiles on the Data Dashboard.  Here schools with broadly similar %5A*-C scores (a few % apart) can be ‘top quintile’ and ‘bottom quintile’.  Here, the plot has been utterly lost. There is no reproducible, meaningful sense in which schools’ outcomes can be processed in this way and convey a sense of the quality of learning or the overall educational experience.

 A value added score of 986.7 or 1016.3.  In science, we teach students about measurement: accuracy, precision, resolution, reproducibility.. and so on.  It is standard practice to evaluate errors in measurement and to take care not to over-state the precision in a final result, relative to the size of errors.  For example, if your stop-clock only measures in seconds you can’t say the time for a feather to drop is 8.63 seconds. If you measure 100 drops, the calculator may tell you the average is  8.63 seconds but your apparatus is not up to the job of giving you that level of precision.  If the reaction time suggests an error of a 1 second, the best you could hope is for an answer of say 9 seconds +/- 1 second.   But.. do the DFE and OfSTED understand this? No they do not.  The VA algorithm is deeply flawed.  School A: VA = 995.2 +/- 11.6  School B:VA = 1002.1 +/- 13.7  (not uncommon)  We are expected to believe School B adds more value than School A… but the errors suggest we cannot make that claim.  Data garbage presented as truth.

 Effect Sizes  of 0.29, 0.65 and 0. 84  My final bit of Data Delusion is the new and growing search for reproducible and reliable educational outcomes from research.  Hattie and Petty, amongst others, have done work in this area and, for me, the outcomes are interesting.  Over many studies, the rank order in effect sizes (derived from standard deviation calculations) leads to a set of high-impact strategies that ring true for me with my subjective bias and values.  I should be really pleased.  But, unfortunately, there is already an overwhelming tendency for people to take these figures at face value.  To begin with their figures are an average. If an effect size is 0.65, no single study may have yielded that outcome; the range may have been 0.1 to 1.2..as the contexts shifted and changed.  Then, the level of precision implied by the second decimal place gives the impression that a 0.65 effect is somehow meaningfully higher than a 0.62 effect – and I have heard people take the surface rank order as gospel truth.  Of course, these are historical, retrospective averages.  They tell you nothing about what might happen in any specific context beyond this:  some initiatives, on average, in the past, have been shown statistically in the particular tests that were done, to exhibit this general pattern.  It might , therefore, be worth looking at these strategies so see if they also work well in your context.    

As I have shared in this post – what Hattie says about homework is complex. And yet, even intelligent people will tell me that 0.29 is a low effect size, therefore homework is a bad strategy.  It makes me weep…..

What we need is an intelligent view of assessment that takes account of the distortions inherent in any measurement process; that is capable of embracing the idea of ‘error’ and that does more to link our assessments to the original learning.   Teachers, leaders, inspectors and politicians need to avoid placing high value on data in ways that cannot be sustained.. Learning is fuzzy; it is complex…..let’s embrace that and not reduce it to something where all meaning has been lost.  It is like dropping a bag of marbles. Even when we know, in physics terms, all the laws that determine the motion of a dropped marble, there are so many variables at play that we cannot predict how the marbles will fall and roll. We can see a pattern, we can look at limits.. but we can’t describe the detail.  If we look at the final resting places of a bag of dropped marbles, similarly, we can’t extrapolate backward to know exactly how they got there.  It this is true for a simple bag of marbles… for learning, it is even more complex.  Let’s recognise that.

To finish:  An unrepresentative anecdotal cautionary note: Another dodgy data dimension is around the process of judging lessons and schools through OfSTED inspections and lesson observations.  I know an inspector who is quite happy to tell teachers their lesson is Good (not Outstanding)…because of the quality of learning he observes and the levels of progress made.  This man is a creationist;  he thinks living things were placed on Earth by a higher being and says ‘I don’t agree with all that evolution stuff’.  Nevermind the evidence. If he comes to your school, you’re in serious trouble!

56 comments

  1. This is one of those posts where I’m with you… I’m with you… I’m with you… Now you’ve lost me – more down to my brain not functioning than anything to do with the post, but as you say it is complex.

    For me the mess of data in education is made worse by how badly we deal with things that we could actually control. We can’t completely control student performance and yet schools are held accountable for just that. Owen Nelton explains that rather eloquently here: http://matheminutes.blogspot.co.uk/2012/11/the-teachers-dilemma.html

    We could deal with the accountability structures by abolishing league tables which only cause gaming. But we don’t.

    And as for your Ofsted anecdote, I’ll take your Creationist Inspector and raise you a Welsh one: http://frogphilp.com/blog/?p=1224

    Like

    • Great story. That’s how these things should be.. the inspectors taking an intelligent view of things. I’ve got a blog in the pipeline about the accountability processes I’d like to see.. basically we need people who know schools well.

      Like

  2. All of these criticisms are not only well-founded; they are broad enough to apply not just to education. Averages do remove a lot of the detail found in sets of data. Standard errors are required whenever differences in performance are discussed. Quantification of qualitative properties is perfectly impossible.

    That being said, these criticisms have no merit in public policy, because they disallow any analysis of teaching. The only way to satisfy these complaints is to not analyze the policy (in this case, education) at all. That alternative is far worse than just admitting the issues but still presenting the analysis.

    Using your value-added score example, if one does find a statistically significant difference (which they will in many cases), then that is actionable evidence. The other criticisms are valid only as caveats, but do not serve to falsify the evidence that one school is performing significantly worse.

    Like

    • Thanks Syed. What I am after is a subtle, intelligent and sophisticated used of data. Some schools/teachers/learners are better than others – that is uncontroversial; let’s accept that this is true. The information we are presented needs to reflect the full range of outputs, with errors and limits to significance fully laid out. There are solutions to grade boundary cliffs, to nonsense standard NC levels and to value added measures… but they require detail. If we’re going to simplify ‘so people can understand’ or some other argument, we need to be very careful… distortions abound. How do you know a school is worse than another – that the learning is worse… in truth, we don’t really know exactly; we have some clues and those have limits.

      Like

  3. I never have understood why we don’t just publish UMS scores instead of converting them into grades. I know it won’t solve all the other problems about unreliable data but it is at least one relatively simple quick fix.

    Like

    • Yup. Agree. The silliest part is when UMS goes to grades and then grades are given a different points score eg A = 52. B = 46… So neighbouring UMS scores at the boundary become separated by 6 points.. nuts!

      Like

  4. Agree with everything you’ve said about statistical data here, Tom, but there’s an even higher, more insidious level of rubbish data we have to contend with that goes “40% of all schools in UK fail our children”. Tis is the crude manipulation of spurious data by the media which then use it as a political lobbying tool for “more rigour” etc etc, which politicians respond to even more quickly than the babble of data you’ve outlined above. And does any of it really show you what our students can do in real life or who they are as people or what potential they have to change our society?
    Now that’s the sort of data I’d be interested in seeing. If only it were calculable…

    Like

  5. You articulate brilliantly why data is delusional. What is even more dangerous is that the people that ought to listen to this wont because it is inconvenient… data gives a veneer of simplicity and convenience to a disordered and complex world. It saves time and (in the current climate) a lot of money.

    Like

    • Absolutely.. it is a kind of unspeakable truth! We’re all so deeply conditioned to accept it.. it will take a massive shift to move to a more organic, nuanced notion of attainment.

      Like

  6. Thoroughly enjoyable read. From the general message to the “more accurate (Or should that be precise?) if it’s got a decimal point”

    Every member of SLT in every school should understand this.

    Like

  7. I think you are spot on with this. National Curriculum sublevels for me exemplify the mess we have got in with crude measurement and accountability, and for whom? Politicians who need either some way justifying policy or a stick with which to bash opponents. And once all this nonsense is aggregated we have a meaningless set of numbers that we pretend tells us something about the difference we are making.

    School leaders feel the pressure most acutely and often pass that pressure on down the line. So we have collections three times a year (or more frequently) that must show progress towards a ‘target’ based on last years’ national cohort. We confuse the micro with the macro.

    More insidious than that is the culture and climate it creates of measurement and judgement. We constantly feel measured and judged along with the students. This distracts from teaching, feedback and learning which should be where most of our energy is spent . And as you say ‘learning is fuzzy’ so it can’t be lost in a box-and-whisker plot.

    Of course we need to measure progress and I believe that schools should be accountable to the local communities which they serve, but the current data does not provide that. Patterns emerging from large sets of data can help us in seeing groups we are serving well and groups we are letting down: if enough marbles repeatedly drop in the same way then that is of interest, that is something we can observe. But we need to be very careful about what we measure to explain that observation.

    Like

  8. Another thought provoking blog Tom. While looking at educational outcomes issues as an independent knowledge broker, I’m also learning from how my daughter is graded by her secondary school across subjects using NC sub-levels. As an informed parent what I appreciate about this information is that she has been set targets at the start of the year and I get an idea of how she is progressing against them, and how this compares with the rest of the cohort. That is probably enough for triangulation of attainment purposes. But what we both look at closely is the effort column. This uses a very simple grade mechanism, but there is no purpose in analysing it or collating it – which suits me fine as it is her teachers’ judgement of her personal qualities in particular subjects – if I get a genuine sense that this is being applied unfairly then I will of course investigate further. Finally, the school makes clear that you can’t compare NC levels across subjects, which should probably be written in bold capitals so that all parents take note. I worry that the proposed secondary accountability system of (progression in) grade point averages across 8 capped subjects, which sounds great for removing the C/D focus, may cause confusion amongst most parents who won’t understand the nuances between attainment in English, Maths, 3 other E-Bac subjects (the sciences could be fun!), and 3 other non E-Bac subjects. I look forward to your next post about this.

    Like

  9. You’re right about the “value added score of 986.7 or 1016.3.” issue, and people who want to know a bit more should read http://www.ofsted.gov.uk/resources/using-data-improving-schools where David Jesson explains exactly this issue on behalf of Ofsted. However, if you’re below 1000 every year for five years in a row, the maths is a bit different. It’s not dead easy to calculate but the chances that there is a genuine difference rather than just normal variation are much higher when it’s repeated year after year.

    Like

    • Thanks for the comment. I agree that all relative measures become more meaningful if sustained over time. However, fundamentally, the VA scale is highly artificial. I’m not even sure that the scale is consistent year on year; it certainly isn’t linear. Eg 1000, 1010, 1020 are not equidistant in a meaningful sense. At least DFE gives the confidence limits; perhaps they should include these for grades too!? That’s where the greatest uncertainties lie.

      Like

      • [2nd attempt at posting – the editor made garbage of some of my punctuation… sorry]

        Hi – coming a bit late to the party…

        Thanks for the blog in general, and this post in particular. You’re highlighting limitations of this field of study that have gradually been dawning on me for a while.

        I disagree with part of your conclusion here:
        > School A: VA = 995.2 +/- 11.6 School B:VA = 1002.1 +/- 13.7 (not uncommon) We are expected to believe School B adds more value than School A… But the errors suggest we cannot make that claim.

        Surely there should be some celebration here that an estimate of error has been published for this? So no, we are not expected to believe that School B adds more value than School A. Arguably, that would have been the conclusion if confidence intervals had not been published – but they were. Of course, the value-added itself still has further problems, but I’m not sure you’ve picked a particularly valuable example to make your point there. What it highlights is that lots of people reading those stats might choose to ignore the confidence intervals – that doesn’t make the stats themselves invalid, just those readers’ statistical awareness.

        Like

      • Hi Jon. I agree – it is good that the confidence intervals are shown. However, the degree of accuracy suggested by the decimal place figures is dubious – untenable. The BBC tables allow sorting by VA figures – without the conf intervals.

        Like

  10. […] We know giving a number or grade when marking negates any comments given.  We have ridiculous situations where a student is graded a 5a for once piece of work and then told they are a 5c six weeks later after another. You can guarantee than any observer in you class will ask the students for the level they think they are currently working at. And when asked what they need to do to progress, the student had better give answer based on progression through the sub-levels. @headguruteacher goes into more detail about the mess we are in with data with his post The Data Delusion. […]

    Like

  11. As someone hoping to step on board the Headship ship sometime soon, this is an interesting post. It’s interesting anyway but particularly so for me right now. We’ve just interviewed for a new head in our current school. I sat in on the presentations of all candidates – given the title ‘the vision for ***’ – the one who most impressed our LA advisers and ultimately went on to secure job? The one who put the simple answer ‘ to get the school to outstanding’ and how was he to do this? through use of data and quality of t&l. After this process, I talked to LA adviser about an upcoming presentation – they said it’s all about the data; everything you mentioned in this post, all its’ negatives, irrelevancies and confusions, THAT’S what they want to hear. I’m not sure where that leaves those of us who agree with you but need to ‘collude’ in order to advance? Perhaps I won’t get on board after all….

    Like

    • Sadly, that is reality of appointments and I have suffered in a similar way. All parties repeat the data mantra because it is the only tangible thing they have. However, it is possible to use data sensibly, in perspective. So, it seems the thing to do is to take the data talk seriously so that people know you understand it… but, in post, to make sure it really isn’t given more weight than it warrants.

      Like

  12. Tom, is there any way I could give you a quick ring this week? I have a question regarding your recent blogs – it relates to an interview which is coming up very soon…! I’m at work this week on 01603 610993, if you are around. I would be happy to phone you back, rather than run up your phone bill! If not, no worries. I shall now print off your blog and staple it to my notice board… Fran

    Like

  13. Excellent stuff. Far too much reliance is placed on dubious data, and far too little attention paid to the inaccuracies and flaws in the assessment processes.

    One of the key pieces of learning I took from my PGCE assessment module was that the year I took A level physics, analysis showed that a candidate awarded a C grade by my exam board had an equal statistical probability of getting a B or a D, given the tightness of the grade boundaries and the marking/moderation error. In the days when a B got you a place in university Physics courses and a D didn’t. The #GCSEfiasco of 2011 helped reinforce this learning.

    Roll on the day when all exam results are required to be published as a numerical mark WITH the +/- standard error. And someone works out a way to accurately assess the reliability of those Ofsted grades 🙂 Then people may become a little more circumspect.

    Like

  14. Fascinating and troubling…

    My question is about how we lead in this context? How do you message this with your SLT and staff more widely? How do you reconcile the frustration and doubt with the need to “get on with it” and work within the deeply flawed system? That’s the part I’m stuck on…. you can dispatch with lesson gradings easily enough but the stuff “higher up the chain” than that? We don’t have a choice to opt out.

    Liked by 1 person

Leave a comment