It’s very welcome news that Amanda Spielman is going to support an Ofsted ‘research push’ as reported here in Schools Week. The publication of results of an initial trial is a step in the right direction. In my ‘Gimme some truth‘ blog at the end of 2014, I complained about the lack of any validity trials so it’s great to read that Amanda Spielman is saying “we should be looking at the validity of inspection: is inspection succeeding in measuring what it is intended to measure?” – quoted in the Schools Week article. Validity and reliability are clearly different but inter-related; both need to be explored.
At the Headteachers’ Roundtable event in February, the findings from the now-published trial were discussed on a panel including Dr Becky Allen from Education DataLab and Stephen Tierney; Becky Allen has written an excellent article “Ofsted inspections are unreliable by design” about numerous innate bias tendencies that inspectors are not immune from. The Ofsted study involved two people conducting parallel inspections of Good primary schools. Becky Allen’s challenge to the info that in 22 out of 24 of the trials, the inspectors agreed, was that she’d have been surprised if they hadn’t. The fact that both inspectors had the same data and visited on the same day, knowing they were part of trial, means that the anchor bias would have been significant; it would have been very hard for an inspector to deviate from what the data suggested. We heard that ‘by lunchtime’, in most cases, the inspectors had agreed with each other.
‘By lunchtime’!? I know primary schools are not huge, but if you look at how many lines of enquiry there are in the Ofsted framework, the idea that you can judge a school with any degree of accuracy in that time is astonishing. Isn’t this rather precarious finger in the air territory? Faced with what is essentially an impossible task, inspectors’ judgements must essentially be guesswork all too often; they have to extrapolate from very thin sampling processes to make overall judgements. In that context, the data-weighting will be very significant as an anchor bias as will several other subjective first-impressions biases. So, that Ofsted trial, whilst being a welcome step, is as much a confirmation that the anchor bias of data in a short inspection is a strong factor. And if that data was wrong? Well…that possibility wasn’t being tested.
So, in the spirit of positive engagement, I have some suggestions for some trials that could be conducted that have more hope of revealing some important truths about reliability and, possibly, validity:
Explore data bias effect:
Are inspectors looking at evidence in a truly open-minded fashion or are all the human bias factors dominant? How significant is this effect?
Suggestion: Repeat one-day primary short-inspection trial but with much greater control of variables with the same schools inspected by multiple inspectors:
- Some given correct data for school
- Some given inflated data for school
- Some given depressed data for school
- Some given no data for school
Repeat for teams in a secondary school.
Explore day variability effect:
Is one day enough to capture the reality of a school – is the evidence base similar enough from one day to the next for judgements to be secure as a general evaluation of a school?
Suggestion: Repeat same trial as above but with inspectors going in on different days to the same schools with and without prior data.
Explore Book Scrutiny element:
Can inspectors reasonably be expected to make valid judgements about standards based on the scale and nature of work sampling that is undertaken in a typical inspection process?
Suggestion: Isolate this process for a focused trial at primary and secondary with multiple inspectors engaging in book scrutiny processes in the same school using the same ad hoc sampling methods used in real inspections – browsing through a few books during lesson drop-ins and the occasional study of a full set of books.
- Some inspectors with prior knowledge of whole school data; some without.
- Some inspectors with knowledge of progress data for the specific students whose books are being scrutinised; some without.
- Engineer process to compare judgements when inspectors see the same books and then totally different books.
Note: this all makes an assumption that an inspection is looking to judge typicality. If the brief is to judge what you see and extrapolate, these trials are redundant. If one set of poor books is sufficient to suggest low standards, then you only need to find one. However, I don’t think that’s what it says in the framework.
Explore Lesson Observation Element:
Can inspectors really judge the quality of teaching, learning and assessment using the processes that are typical in inspections? How many lessons should be seen and in what configuration to be confident that any judgements reached give a fair picture of standards across the school? Can inspectors make meaningful judgements in the absence of any triangulating data for the classes they observe? Should any data be seen before or after the observation to maximise reliability?
Suggestion: Isolate this process for a focused trial at primary and secondary with inspectors undertaking a typical range of 7-8 minute lesson observations and walk-throughs:
- Some inspectors with knowledge of school data; some without
- Some inspectors knowing data for the students in the class; some not
- Some inspectors knowing historical data for the department; some not
- Some inspectors seeing the same teachers; some seeing entirely different teachers;
- Some inspectors seeing different numbers of lessons
- Some inspectors seeing only lower attaining teaching groups; some seeing a mix of ability groupings.
Group Dynamics Effect:
In much the same way as shadow jury trials have been conducted to explore the dynamics of how verdicts are arrived at; it would be possible to evaluate the power of a Lead Inspector to shape the outcome of an inspection. Is this a key factor and do inspectors generally reach decisions they are satisfied with?
Suggestion: conduct some inspection trials with say eight people with two lead inspectors. Randomly divide the inspectors into two teams; one led by an inspector pushing for a lower judgement; the other led by an inspector pushing for a higher judgement. Interview each person before the group meets, keeping everyone in isolation, then run the final meetings to arrive at overall judgements. What happens?
Obviously there are ethical and logistical issues to be addressed but, if Ofsted is going to be serious about this, then controlled trials of this kind will need to happen. It’s still the case that absolutely no trials with any degree of control have taken place in a secondary school. Asking people how they feel about it afterwards in the post-inspection surveys is about all we have.
Given the high stakes nature of the judgements – and the sheer number of sub-judgements that formed in a very short time, it’s vital that the process is subjected to some high intensity scrutiny. It could be that inspectors’ judgements are pretty reliable despite the scale of the task they face. That would be good news for any school facing inspection. But what if it all proves to be a house of bias-laden cards? That would suggest inspections would need to change in a fairly radical manner. At the moment we literally don’t know. It’s no use simply asserting confidence in reliability when it has never been tested. So – it’s great news that this is now on the agenda. Let’s get some serious trials underway, conducted by external, independent people who know how these things are done.