School inspections add value to data-driven school performance metrics by sending an experienced educator to collect first-hand evidence from inside a school. The human element of an Ofsted visit is a feature, not a bug.
But each inspector comes with their own unique set of experiences and priorities. This can lead to inconsistency. Two inspectors might reach different conclusions about the same school.
Given that perfect reliability is not desirable, how much reliability should we expect?
The American Educational Research Organisation argues that the higher the stakes of any assessment, the more reliable it should be. Big decisions require reliable judgements.
It is well known that Ofsted ‘Inadequate’ judgments can lead to school closures or heads losing their jobs. So when it comes to the lowest Ofsted judgements, we should expect good reliability.
Christian Bokhove, John Jerrim and I have just released new Nuffield-funded research comparing the judgements reached by 1,376 different inspectors across 35,751 schools between 2012 and 2019.
We found that primary schools assigned a female lead inspector are around one-third more likely to receive an ‘Inadequate’ judgement. Just under 6 per cent of judgements reached by female inspectors were inadequate versus 4.5 per cent by male inspectors.
Maybe female inspectors tend to get sent to weaker schools? But we found that this pattern held even when we compared male and female inspectors sent to inspect schools with the same prior Ofsted inspection rating, exam results, levels of pupil absences, pupil intake, and in the same region of the country.
Of course, we can’t definitively establish that there were no differences between the schools to which male and female lead inspectors were assigned. Maybe there were subtle differences – visible to the inspectors, but not in our data.
The only way to definitively establish the reliability of Ofsted inspections is to send two Ofsted inspectors to the same school, and check whether they agree. Indeed, you may remember that Ofsted did just such a study back in 2016 and found that the two inspectors tended to agree.
But this research had some important limitations. Crucially, the inspected schools were all previously rated ‘Good’, meaning they were subject to a short inspection in which the presumption was that they remained ‘Good’ unless proven otherwise. The inspections were also conducted by more senior inspectors, known as HMIs.
At the time, Amanda Spielman described this study as a “first step” and said that Ofsted should “routinely be looking at issues of consistency and reliability”. Ofsted has conducted a range of research since. However, there have been no more of these gold-standard two-inspector-one-school studies since.
Crucially, there has been no research on the critical ‘Inadequate’ judgements. These are big decisions, but we do not have any evidence to suggest that they are reliable. Indeed, our new research provides some evidence to suggest they may not be.
Spielman’s term as Chief Inspector comes to an end in January 2024. And current polling suggests the government may lose power in the general election soon after. This creates a window of opportunity for modernising Ofsted. But what should be done?
Labour has recently dropped its Corbyn-era policy of abolishing Ofsted, promising instead to reform the inspectorate and focus it more directly on school improvement. Retaining Ofsted will likely be popular with parents. But Bridget Phillipson was heckled by teachers when she announced the plan at a union conference this week.
I would advise the shadow secretary of state to announce a series of new Ofsted reliability studies. These should use the gold-standard two-inspector-one-school methodology. And there should be four studies, focusing on schools in each of the four categories.
This would likely be popular with teachers who demand to know whether the methods by which they are held to account are reliable. It should also be popular with parents who will learn about how much weight to place on judgements.
Importantly, the results would also provide the information policymakers need to make an informed decision about whether we have struck the right balance between the consequences of inspections and their reliability.
“Big decisions require reliable judgements.”
The same is true for GCSE, AS and A level grades too, for which reliability is even more important – being awarded a wrong grade can be life-changing. As happened in August 2022 for about 23,000 students who received certificates showing grade 3, fail, when, had a senior examiner marked their scripts, they would have been awarded grade 4, pass.
Unreliable grades do great damage, as discussed in FE Week a few days ago https://feweek.co.uk/gcse-re-sits-wrong-grades-drain-students-and-resources/