Towards an Assessment Paradigm Shift

Despite my reservations about some of the big data measures that are used to judge schools, I am hopeful that our discourse is shifting the debate on assessment in a very positive direction.  If this continues it will represent an important paradigm shift with positive consequences for students’ learning and their overall school experience.

The Old Paradigm:  Macro Summative Attainment Tracking

This has reigned supreme for the last 10-15 years or possibly more.   The focus has been on trying to represent students’ attainment across all disciplines in terms of generic ladders of grades and levels supported by descriptors so that attainment and progress could be tracked.  The value given to data tracking has been the driving force – the idea that summative macro data is a requirement for driving up standards.

The elements of this paradigm – which still is absolutely dominant- include the following:

  • Confusing bell-curve ranking with absolute standards.  A C grade or a Level 4 never were or are indications of absolute standards; they only make sense in reference to a cohort. It’s never been true that, say, ‘explaining how a motor works’ is Grade B or ‘solving simultaneous equations’ is Grade 8.  If a student has a grade 5 on their report – we know nothing about what they know.  The very worst examples of this are where schools use GCSE grades as a ladder – the horrible notion of ‘Working At’ grades that has now penetrated into KS3 in some schools.  In most subjects there is no meaningful sense in which you are at Grade 3, then a 4, then a 5; or a D then a C then a B.  Bell-curve markers do not work in that way..it’s a nonsense.  (I’ve even seen examples of  3+ and 4- being used – as if they are definitely, definably distinct).
  • The illusion of ‘progress’ as something than can be measured via reference to bell-curve grades or data points in general.  ‘Levels of progress’ was the worst example of this – with the absurd but pervasive misconception that a jump from a 3 to a 5 or a 4 to a 6 are broadly equal without any reference to the content of what is being learned.  When we were forced to talk about students making ‘five sub-levels of progress’ I really thought we had lost our minds… it’s pure data idiocy.  This has now gone (Gove’s greatest legacy in my view), but the very idea that ‘progress’ in learning has a measurable size, ludicrous as it is, is still widely held onto.
  • The target-grade culture:  The idea that, by setting systematic attainment targets in the language of summative grades, students will learn more.  This is a deeply flawed notion.  I am on a C. My target is a B.  I must work harder to get a B.  This sequence is devoid of learning content and rarely translates into anything actionable for students – because ‘being on a C’ doesn’t mean enough at the level of what they know and can do beyond ‘I must work harder and learn more to reach a higher grade‘ which is universal at any level.
  • The attempt to use summative tests formatively.  This is very common with students  sitting endless mock exams and using ‘GCSE-style questions’ prematurely.  It’s akin to getting students to play a whole piano piece over and over before they’ve learned their scales – or asking someone to play a match without learning the basic passing skills.
  • Centralised data tracking machinery: Schools across the land have got spreadsheets like this where data is entered for the purposes of tracking.  You can stare at the numbers and grades all day long, but unless this feeds back into different actions in the classroom, no child learns anything more.

screen-shot-2016-09-29-at-21-06-17

One of the reasons I don’t like the mechanics of Progress 8 is because they reinforce the macro data-tracking behaviours of leaders at the expense of the micro learning-focused shifts that are more important… But that’s another story.

I enjoyed the response to this twitter ‘thought experiment’. Try it:

The trouble with macro data-tracking is that the journey from the spreadsheet to the classroom is circuitous at best, and normally doesn’t happen at all.  All we are doing is generating data that might tell a story about relative student performance  – it might, at best, give a picture of where students are – but it does nothing to take them further.  Even here, the grades themselves do not actually tell us anything at all about what students know or can do; literally nothing.

Screen Shot 2017-07-15 at 23.11.20

Finally, a part of this paradigm has been our approach to marking.  For years now, marking has been largely a PR exercise where how the marking looks has been more important than what it achieves with an appallingly low impact to workload ratio.  Most marking is wasted.  It doesn’t lead to students knowing more, understanding things better or producing better quality work.  This in part due to the eternal marking paradox that those students who need the most help are the least able to interpret the real meaning of marking in order to adjust their thinking or performance.  However, school control cultures – (insert reference to external accountability pressure if you want) – have demanded levels of marking – of red pen – that bear no relation to student’s progress in their learning.

Across the land, marking is peppered with things like B-, Good Effort, your ending is rather vague;  explain this in more detail – that kind of thing: Comments that very many students cannot use to actually improve their work.  Then there is the use of www/ebi self-evaluation where students write things like ‘I must avoid making silly mistakes’ – or ‘I need to revise more‘. These generic wish-statements cannot actually make a difference to what students know, understand or can do.  Again, the approach is based on the need to satisfy an external machine that requires compliance to a set of institutional expectations (here, that www/ebi is ‘a good thing’) rather than being focused on learning itself.

The New Paradigm: Authentic learning-focused formative assessment. 

Luckily, there is light. Things are changing, stirring, evolving. A new dawn beckons….(or is it going back to a previous dawn that faded, years ago?)  There are several key influences that I think are helping to shift us towards a new paradigm:

  • Dylan Wiliam retains his high-profile influencer status and is continually reinforcing the importance of ‘responsive teaching’ – the true meaning of AfL; the vital role of ‘minute by minute’ formative assessment where teachers check for understanding, adjust their teaching and continually seek to deepen students’ understanding and knowledge. This remains at the heart of real assessment: it’s in the moment with tight feed-back loops leading to immediate actions focused on specific elements of learning.
  • Daisy Christodoulou’s Making Good Progress has exploded our fixed ideas about assessment practice, showing how problematic our use of summative testing is; how rare it is to see genuine formative assessment: high frequency, low-stakes, narrowly focused testing using raw marks, owned only by a teacher and her/his students, feeding directly back into the teaching and learning process.  She has also promoted important ideas around comparative judgement as a truly reliable means of gauging relative standards, – mainly (but not only) in relation to writing in English –  free from the flaws of descriptors and rubrics that are so hard to use consistently.
  • Cognitive Science is growing in its reach into our consciousness as a profession. The lessons from cognitive load theory – amongst others – encourages us to employ much more direct means of improving students’ knowledge by using effective instructional methods, regular retrieval practice through knowledge reviews and low-stakes recall testing.  This then allows us to gauge how well students are doing in terms of how much they actually know about specific topics.  The rise of knowledge organisers and personal learning checklists is helping to frame this work: being more explicit about what students should know and then helping them to learn it. Here the assessment is purely in terms of what they know in the absence of any proxy grading system.
  • The popularity of Ron Berger’s An Ethic of Excellence and the fabulous metaphor-packed Austin’s Butterfly is hugely influential, in my view.  Here we see that we need to define our butterflies.  We need to spell out or at least exemplify what excellence might look like and then devise iterative feedback processes that allow students to see the steps from where they are to where they could be in the detail of their learning goals.  Austin and his teacher do not need a graded ladder or tracking system to help him improve; it’s all about the detail of the work itself linked to a clear idea of what excellence might look like, modelled by an exemplar.
Screen shot 2013-11-04 at 21.35.39
Austin’s Butterfly. The final draft was always within him. It just needed to find a way out.

 

  • Schools such as Michaela are showing that, if you think hard about what you do and are firmly focused on maximum impact on learning, and not answering to the external machine, you engineer paradigm shifts around practices such as feedback and marking.  Ideas such as whole-class feedback instead of traditional book marking are catching on.  Jo Facer’s blog on this is superb; it’s a game-changer. Instead of slavishly marking books, we should be giving whole-class feedback that is prompt, immediately actioned, workload-efficient and effective in securing improvement. This is what matters, not the red pen – and NOT the ‘verbal feedback given’ stamp.  FFS.

The challenge with all these elements of the ‘new paradigm’ – is that they do not produce neatly aligned datasets for leaders to scrutinise on their Management Information Systems.  The devil is all in the detail.  It’s more precise and simultaneously more organic.  Instead of seeing that a student is on a 6 or scored  57% on a mock exam, you need to see the work in their books, to see the test scores on very specific topics – knowing and understanding that this does not have an associated level or grade on a bell-curve.  This is authentic assessment, focused on learning, not on creating false codified meaning to facilitate comparisons outside the learning arena.

In this context, leaders would need to do much more to triangulate information so that they build up a picture of what is going on for any student without placing undue emphasis on any one part.

Screen Shot 2017-07-15 at 23.11.44

There are areas that I think we need to develop further still – or avoid:

Avoid the development of massive unwieldy centralised statement banks: the horror of the ‘Can Do’ statement machine.  Once you’ve got a hundred tick-boxes, you’ve created something you can’t monitor closely enough to feedback into your teaching.   The statements themselves are often too vague (I can explain photosynthesis; I can add and multiply fractions; I can evaluate the use of metaphors in a poem) – because it all depends on the context of those questions.  You need concrete examples.  Replacing an amorphous generic grade with a massively unwieldy list of vague statements is not an improvement.  The key is to keep formative information as close as possible to the site of the learning – in students’ books and folders, not on a teacher’s computer.

Let’s give a reprieve to Grades 1-3 and get them out of the box marked FAIL.  Until all the points on a bell curve – that forces 30% of students into the bottom 30% (shocking, I know!) – are seen for what they are in neutral terms, we’re writing off thousand of children in an unacceptable fashion.

Let’s do more to share exemplar work nationally so that we can define standards at any given level against concrete examples.  This could be through the content expected in difficulty-based subjects like maths and science and through examples of work – in line with the comparative judgement process – in English and History.  The work done by NoMoreMarking in this area is a great example of this.  It’s the start of what could become a national database of standards that all could access and use of comparative judgement.

I do think the changes I’ve highlighted here would represent a paradigm shift if we could take the whole system on this journey. It feels like it’s still at an embryonic stage but at least we’ve made a start.

 

 

27 comments

  1. Of course all of this relates back to accountability systems. We live in a legalistic society where blame can be shifted easily between those in positions of responsibility. This needs to change to truly free up assessment.

    Like

  2. Interesting.

    My understanding is that “assessment” is a shortened version of “assessment of learning”. The learner learns something and we then just need to find out whether they have learned what was intended (and for me the thing learned that were unintended).

    Assessment for me as a professional educator is simple. Assessment for me is a individual pursuit. Classes don’t learn, individuals learn.

    It is fairly easy to assess the current state of learning and to provide guidance for future learning, why do people wish to make the thing much more difficult than it is. I can see that when we wish to amalgamate the inferences drawn from individual assessments to speculate about classes of 30, schools of 1000, countries of 10,000,00 then the thing gets a bit unwieldy but as a teacher assessing the learning of individuals I simply do just that and leave the rest to those who engage in such flights of fancy.

    Liked by 1 person

  3. Great article, with many an interesting point! Its heartening to see that assessment tactics are changing and challenging the status quo, and frankly, it can’t happen fast enough. With the growth of technology and it being integrated into the classroom, at all levels, there seems to be many shifts happening to the assessment paradigm as teachers and students alike have new tools to utilize and help them teach and learn. It’ll be interesting to see how it all evolves in the coming years. Thanks for your article and insight!

    Liked by 1 person

  4. Regarding No More Marking: It’s my understanding that this is used as the marking method for the PoP tests – we use these. However, I’ve had students who are clearly bright (full marks on SATS, completed a Language GCSE paper at grade 6), scoring lower than students who are only expected to achieve a grade 2 at the end of GCSE! It’s because of this that I struggle to treat the results as reliable and accurate.

    Like

    • Hi. Thanks for the comment. I guess this shows just how difficult it can be to capture a child’s attainment in single assessments. A teacher’s judgement on an overall performance or an exam mark may be more accurate than the comparative judgement outcome on a single piece of work. None of these things in themselves lead to improvement. It’s what happens next that matters.

      Like

  5. A good summary of the landscape, thank you Tom.

    A minor quibble, it’s not true to say “A C grade or a Level 4 never were or are indications of absolute standards”. For many years grading meetings at exam boards consisted of looking at students work and identifying a band of marks within which the C grade boundary lay. This was done with reference to the Grade Descriptors and to scripts from the previous exam series at the grade boundary.
    Only when the ‘zone of uncertainty’ had been determined – usually around a 3 mark band – did we consider the statistics and whether the proportion of the cohort achieving a particular grade would be very different from last year. If it was close it would be accepted, if it was very different we would be asked to consider why that might be – was there a change in cohort (we had information about the types of schools in the cohort for example). We also looked at how the same students had done on other papers in the suite and also in other subjects – e.g. maths (as we were grading science).
    Later, sadly, it became less criterion referenced and more driven by data – using KS3 SATs scores, and then KS2 SATs scores to describe the cohort. This lead to the situation we have now, where a new curriculum and specification might lead to better teaching and performance but this would not be reflected in better grades, because of norm referencing.

    Sorry a minor quibble lead to a lot of words, to justify the quibble!

    Like

    • Thanks Mary. Yes – I’ve glossed over this a bit. I’m aware that in grade setting meetings at exam boards, there is reference to papers from current and past years so that grade boundaries have some basis in the quality of answers. It’s not purely statistical. However, I am really saying that even this notion of standards has arisen from a bell-curved overview of student attainment. Within each paper in science there is no meaning in saying that any one question is a B question or a D question. Students with the same percentage score will have a full range of permutations of correct answers. There might be a general trend that emerges where some questions are easy and others are hard but that’s not directly linked to any one grade. That’s what I was getting at.

      Like

  6. A really good piece.

    The nonsense of single questions being used to indicate a NC level for a child was appalling. The NC levels being then subdivided into a,b,c and then further into half-levels and single tests / a few questions / single stamens being used to assess the level of a child made matters worse.
    All secondary schools being measured on the ‘progress’ that children made relative to a KS2 level (and then often a teacher assessed level – based on what?).
    All children could only go ‘forward’ in terms of their levels – adopted by many schools.

    Worse – NC levels abandoned to be replaced by schools’ own levels – criteria referencing different in every school ion the land! Who understands it? The teachers? The students? The parents?

    The way forward?
    individual and immediate feedback (not always visible)
    perl and self assess (green pen?)
    norm-reference test results for Quality Assurance

    Liked by 1 person

  7. […] I think this is a major area for review and reform in many schools.  Too many data systems have too much data; data that is never ever acted on; that is too remote from the information that actually drives learning gains and that is largely geared to create illusions of linear progress to satisfy accountability processes.  However, slowly but surely, the Emperor’s clothes are being seen for what they are – and we are beginning to see the assessment paradigm shift I talk about here.  […]

    Like

  8. Is criterion referencing so wrong? If I know these things, show these skills, solve this type of equation should I not get the same credit as others before? I question the need for the bell curve at all. Why should education be competitive? To find the ‘best’?. At What? Why? A real paradigm shift would be to explore how criterion referencing could be used to work for all. The bell curve strikes me as a way of ranking young people on the whole range of scion economic factors that took them to that place at that time – and little else.

    Like

    • It’s not about being right or wrong – it’s about being meaningful. I’ve blogged before about standards and the bell-curve – it’s just inescapable that you need to compare things to understanding ‘difficulty’ or ‘excellence’. You can’t criterion reference maths or an answer in English or Science because there are always harder questions, deeper levels of knowledge and so on. If lots of people can do something it is easy; if only a few can do it it is hard – and that helps us make sense of what we mean by standards. The other issue is the shere scale of the number of criteria you would need to describe all the knowledge students have in any meaningful way – if we go for precision, it would be unwieldy very quickly or it could be very lose and generic which is no good either. For me the issue is not to reject the bell-curve – we can’t – it is to understand it and make sure that every place on it has value for what it is – we can’t all be above average.

      Like

  9. There is that need for winners again – excellence…
    My point is that the borders of excellence or standards are artificially created to manage supply and demand. We are not struggling for doctors in this country because there are not enough students with the ability to do the job – we are short because artificial limits are placed on the numbers who can be trained and the current paradigm means there must be competition to pick the winners – only we are mainly picking those at best advantage in the race. Easy and hard? Ask someone struggling to pass the driving test if it is easy yet the majority can drive. By your arguments should we not have the bell curve applied to the driving test? Limit the number of people on the road? Only the best drivers get to drive the best cars. We manage to do many difficult things!

    Like

  10. Thanks very much for this, Tom, it’s really useful.
    Do you think there is no place at all for summative assessment data collected on a whole-school level?

    I’m part of a team setting up a new school.
    We are thinking 1 set of summative tests per term. My plan was to normalize all the data and just look for any long term trends. Sensible or a waste of time?!
    I am also very much bearing in mind your comments about the need for a plan of action.

    Like

Leave a comment