Ofsted Reliability Trials. Some suggestions.

Image: SciChem International (CASE equipment).  Testing variables in the resonant frequency of pipes.

(UPDATE at the Bottom…)

It’s very welcome news that Amanda Spielman is going to support an Ofsted ‘research push’ as reported here in Schools Week.  The publication of results of an initial trial is a step in the right direction. In my ‘Gimme some truth‘ blog at the end of 2014,  I complained about the lack of any validity trials so it’s great to read that Amanda Spielman is saying “we should be looking at the validity of inspection: is inspection succeeding in measuring what it is intended to measure?” – quoted in the Schools Week article.  Validity and reliability are clearly different but inter-related; both need to be explored.

At the Headteachers’ Roundtable event in February, the findings from the now-published trial were discussed on a panel including Dr Becky Allen from Education DataLab and Stephen Tierney; Becky Allen has written an excellent article “Ofsted inspections are unreliable by design”  about numerous innate bias tendencies that inspectors are not immune from.   The Ofsted study involved two people conducting parallel inspections of Good primary schools. Becky Allen’s challenge to the info that in 22 out of 24 of the trials, the inspectors agreed, was that she’d have been surprised if they hadn’t.  The fact that both inspectors had the same data and visited on the same day, knowing they were part of trial, means that the anchor bias would have been significant; it would have been very hard for an inspector to deviate from what the data suggested.  We heard that ‘by lunchtime’, in most cases, the inspectors had agreed with each other.

‘By lunchtime’!? I know primary schools are not huge, but if you look at how many lines of enquiry there are in the Ofsted framework, the idea that you can judge a school with any degree of accuracy in that time is astonishing. Isn’t this rather precarious finger in the air territory? Faced with what is essentially an impossible task, inspectors’ judgements must essentially be guesswork all too often; they have to extrapolate from very thin sampling processes to make overall judgements.  In that context, the data-weighting will be very significant as an anchor bias as will several other subjective first-impressions biases.  So, that Ofsted trial, whilst being a welcome step, is as much a confirmation that the anchor bias of data in a short inspection is a strong factor.   And if that data was wrong? Well…that possibility wasn’t being tested.

So, in the spirit of positive engagement, I have some suggestions for some trials that could be conducted that have more hope of revealing some important truths about reliability and, possibly, validity:

Explore data bias effect:

Are inspectors looking at evidence in a truly open-minded fashion or are all the human bias factors dominant?  How significant is this effect?

Suggestion: Repeat one-day primary short-inspection trial but with much greater control of variables with the same schools inspected by multiple inspectors:

  • Some given correct data for school
  • Some given inflated data for school
  • Some given depressed data for school
  • Some given no data for school

Repeat for teams in a secondary school.

Explore day variability effect: 

Is one day enough to capture the reality of a school – is the evidence base similar enough from one day to the next for judgements to be secure as a general evaluation of a school?

Suggestion: Repeat same trial as above but with inspectors going in on different days to the same schools with and without prior data.

Explore Book Scrutiny element:  

Can inspectors reasonably be expected to make valid judgements about standards based on the scale and nature of work sampling that is undertaken in a typical inspection process?

Suggestion: Isolate this process for a focused trial at primary and secondary with multiple inspectors engaging in book scrutiny processes in the same school using the same ad hoc sampling methods used in real inspections – browsing through a few books during lesson drop-ins and the occasional study of a full set of books.

  • Some inspectors with prior knowledge of whole school data; some without.
  • Some inspectors with knowledge of progress data for the specific students whose books are being scrutinised; some without.
  • Engineer process to compare judgements when inspectors see the same books and then totally different books.

Note: this all makes an assumption that an inspection is looking to judge typicality. If the brief is to judge what you see and extrapolate, these trials are redundant.  If one set of poor books is sufficient to suggest low standards, then you only need to find one.  However, I don’t think that’s what it says in the framework. 

Explore Lesson Observation Element: 

Can inspectors really judge the quality of teaching, learning and assessment using the processes that are typical in inspections? How many lessons should be seen and in what configuration to be confident that any judgements reached give a fair picture of standards across the school?  Can inspectors make meaningful judgements in the absence of any triangulating data for the classes they observe?   Should any data be seen before or after the observation to maximise reliability?

Suggestion: Isolate this process for a focused trial at primary and secondary with inspectors undertaking a typical range of 7-8 minute lesson observations and walk-throughs:

  • Some inspectors with knowledge of school data; some without
  • Some inspectors knowing data for the students in the class; some not
  • Some inspectors knowing historical data for the department; some not
  • Some inspectors seeing the same teachers; some seeing entirely different teachers;
  • Some inspectors seeing different numbers of lessons
  • Some inspectors seeing only lower attaining teaching groups; some seeing a mix of ability groupings.


Group Dynamics Effect: 

In much the same way as shadow jury trials have been conducted to explore the dynamics of how verdicts are arrived at; it would be possible to evaluate the power of a Lead Inspector to shape the outcome of an inspection.  Is this a key factor and do inspectors generally reach decisions they are satisfied with?

Suggestion: conduct some inspection trials with say eight people with two lead inspectors.  Randomly divide the inspectors into two teams; one led by an inspector pushing for a lower judgement; the other led by an inspector pushing for a higher judgement.  Interview each person before the group meets, keeping everyone in isolation, then run the final meetings to arrive at overall judgements. What happens?


Obviously there are ethical and logistical issues to be addressed but, if Ofsted is going to be serious about this, then controlled trials of this kind will need to happen.  It’s still the case that absolutely no trials with any degree of control have taken place in a secondary school.  Asking people how they feel about it afterwards in the post-inspection surveys is about all we have.

Given the high stakes nature of the judgements  – and the sheer number of sub-judgements that formed in a very short time, it’s vital that the process is subjected to some high intensity scrutiny.  It could be that inspectors’ judgements are pretty reliable despite the scale of the task they face.  That would be good news for any school facing inspection.  But what if it all proves to be a house of bias-laden cards?  That would suggest inspections would need to change in a fairly radical manner.  At the moment we literally don’t know.  It’s no use simply asserting confidence in reliability when it has never been tested.   So – it’s great news that this is now on the agenda. Let’s get some serious trials underway, conducted by external, independent people who know how these things are done.

September Update 2017. 

Fair play to Amanda Spielman  – she had read this blog and tackled some of my suggestions for a ‘magnificently ambitious research programme’ in her talk at ResearchEd 17 in Newham on 9th September.

Screen Shot 2017-09-10 at 14.39.27.png
Watch from 11.30 for the response to this blog.

It’s interesting.  The objections are based on costs and practicalities  –  not principles – and it would require setting up some fairly labour-intensive scenarios and contrived conditions to generate the comparisons needed to establish some general patterns around biases of different kinds.  I accept that these idealised experiments are unlikely to happen.  As a consequence, some schools and Heads will continue to pay the price for poor inspection processes where the biases come into play, however isolated these cases might be.  Hey-ho.

For sure the book scrutiny and lesson observation trials detailed above would actually be relatively easy to deliver on  – and certainly worth the cost, given the impact they have on the judgement-making process.  How many lesson observations, conducted in what way; how many books sampled with what methodology, allow inspectors to gain a sense of standards within a reasonable degree of accuracy?   The flick-through on-the-fly book scrutiny process is so poor – surely we can’t let it persist.  And as for scanning work and letting other inspectors join the process from a distance?  Surely the amount of work needed to reach a sound judgement would make this totally impractical.

It’s good to hear Amanda being very explicit that schools should not be grading individual lesson observations.  That can’t be said enough. However it is also a fact that in my last inspection, one inspector was telling people what  lesson grades would have been ‘in old money’.  This featured in our complaint but, due to the circular, Kafkaesque nature of this complaints process, where Ofsted is judge, witness and jury, there was no evidence of this so officially, it did not happen! Our assertion that it did, was not sufficient to count as being true. So – the thing of people harking back to grading cuts both ways.  I have also met Headteacher-inspectors who have given me the ‘you can just tell’ line about schools they’ve inspected.  Professional judgement or pure bias? It seems very precarious to me.  Out there across the country, some schools with their new Good logo, will actually be much worse than other schools living with the burden of being RI.  It’s unacceptable.

If Amanda’s alternative suggestions are open to full scrutiny we might get somewhere. For now,  it remains the case that secondary school inspection processes are literally untested for reliability and the consequences for coming out on the wrong side of the coin toss are still huge.  Several Heads’ heads will roll this year on the back of that… . So, while we can all agree that, yes, it’s all rather difficult and expensive… let’s not pretend that it’s all rosy; our inspection system still crushes people unjustly.  Are you comfortable with that? I’m not.  I’d like to hear Amanda acknowledging that side of this whole debate more often.  The human cost is real and that really needs to change.


  1. […] It’s very welcome news that Amanda Spielman is going to support an Ofsted ‘research push’ as reported here in Schools Week.  The publication of results of an initial trial is a step in the right direction. In my ‘Gimme some truth‘ blog at the end of 2014,  I complained about the lack of any validity … Continue reading →https://teacherhead.com/2017/03/11/ofsted-reliability-trials-some-suggestions/ […]


  2. This is a brilliant set of ideas. I’ve often wondered what would happen if inspectors had to inspect without being permitted to look at any data or knowmthe previous grade but this takes it further. Maybe schools could ask to be included in this trial. Some options might have to be just a trial and not an actual inspection; somthe school would get a private unpublished report- or maybe several reports!

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s