Is reliability becoming the enemy of validity?

What would happen if I, as a history graduate, set out to write a mark scheme for a physics GCSE question? I dropped physics after year 9 but I think it is possible I could devise some instructions to markers that would ensure they all came up with the same marks for a given answer. In other words my mark scheme could deliver a RELIABLE outcome. However, what would my enormously experienced physics teacher husband think of my mark scheme? I think he’d either die of apoplexy or from laughing so hard:

“Heather, why on earth should they automatically get 2 marks just because they mentioned the sun? You’ve allowed full marks for students using the word gravity…”

After all I haven’t a notion how to effectively discriminate between different levels of understanding of physics concepts. My mark scheme might be reliable but it would not deliver a valid judgement of the students’ understanding of physics.

A few weeks ago at ResearchEd the fantastically informed  Amanda Spielman gave a talk on research Ofqual has done into the reliability of exam marking. Their research and that of Cambridge Assessment suggest marking is more RELIABLE than has been assumed by teachers. This might surprise teachers familiar with this every summer:

It is late August. A level results are out and the endless email discussions begin:

Hi Heather, Jake has emailed me. He doesn’t understand how he got an E on Politics A2 Unit 4 when he revised just as much as for Unit 3 (in which he got a B). I advised him to get a remark. Best, Alan

Dear Mrs F, My remark has come back unchanged and it means I’ve missed my Uni offer. I just don’t understand how I could have got an E. I worked so hard and had been getting As in my essays by the end. Would you look at my exam paper if I order it and see what you think? Thanks, Jake

Hi Alan, I’ve looked Jake’s paper. I though he must have fluffed an answer but all five answers have been given D/E level marks. I just don’t get it. He’s written what we taught him. Maybe the answers aren’t quite B standard – but E? Last year this unit was our best paper and this year it is a car crash. I’ll ask Clarissa if we can order her paper as she got full uniform marks. It might give some insight.  Heather

Alan, I’ve looked at Clarissa’s paper. See what you think. It is a great answer. She learns like a machine and has reproduced the past mark scheme. Jake has made lots of valid points but not necessarily the ones in the mark scheme. Arguably they are key, but then again, you could just as easily argue other points are as important and how can such decent answers end up as E grade even if they don’t hit all mark scheme bullet points precisely? I just despair. How can we continue to deliver this course with conviction when we have no idea what will happen in the exam each year? Heather

I don’t like to blow my own trumpet but the surprisingly low marks on our A2 students’ politics papers was an aberration from what was a fantastic results day for our department this year:

Hi Heather, OMG those AS history results are amazing!!!! Patrick an A!!!! I worried Susie would get a C and she scored 93/100, where did that come from? Trish

I don’t tend to quibble when the results day lottery goes our way but I can admit that it is part of the same problem. Marking of subjects such as history and politics will always be less reliable than in maths and we must remember it is the overall A level score (not the swings between individual module results) that needs to be reliable. But… even so… there seems to be enormous volatility in our exam system. The following are seen in my department every year:

  1. Papers where the results have a very surprising (wrong) rank order. Weak students score high As while numerous students who have only ever written informed, insightful and intelligent prose have D grades.
  2. Students with massive swings in grades between papers (e.g. B on one and E on the other) despite both papers being taught by the same teacher and with the same general demands.
  3. Exam scripts where it is unclear to the teacher why a remark didn’t lead to a significant change in the result for a candidate.
  4. Quite noticeable differences in the degree of volatility over the years in results depending on paper, subject (history or politics in my case) and even exam board.

Cambridge Assessment have been looking into this volatility and suggested that different markers ARE coming up with similar enough marks for the same scripts – marking is reliable enough. However, it is then assumed by the report writers that all other variation must be at school/student level. There is no doubt that there are a multitude of school and student level factors that might explain volatility in results such as different teachers  covering a course, variations in teaching focus or simply that a student had a bad day. However, why was no thought given to whether lack of validity explains volatility in exam results?

For example, I have noticed a trend in our own results at GCSE and A level. The papers with quite flexible mark schemes, with more reliance on marker expertise, deliver more consistent outcomes closer to our own expectations of the students. It looks like attempts to make our politics A level papers more reliable have simply narrowed the range of possible responses that get reward limiting the ability of the assessment to discriminate effectively between student responses. Organisations such as HMC know there is a problem but perhaps overemphasise the impact of inexperienced markers.

The mounting pressure on exam boards from schools has driven them to make their marking ever more reliable but this actually leads to increases in unexpected grade variation and produces greater injustices as the assessment becomes worse at discriminating between candidates. This process is exacerbated by the loss of face to face standardisation meetings (and in subjects such as politics markers unused to teaching the material) and thus markers are ever more dependent and/or tied to the mark scheme in front of them to guide their decision making. If students regularly have three grades difference between modules perhaps the exam board should stop blathering on about the reliability of their systems and start thinking about the validity of their assessment.

The drive for reliability can too often be directly at the expense of validity.

It is a dangerously faulty assumption that if marking is reliable then valid inferences can be drawn from the results. We know that for some time the education establishment has been rather blasé about the validity of its assessments.

  • Apparently our country’s school children have been marching fairly uniformly up through National Curriculum levels, even though we know learning is not actually linear or uniform. It seems that whatever the levels presumed to measure it was not giving a valid snapshot of progress.
  • I’ve written about how history GCSE mark schemes assume a spurious progression in (non-existent) generic analytical skills.
  • Too often levels of response mark schemes are devised by individuals with little consideration of validity.
  • Dylan Wiliam points out that reliable assessment of problem solving often requires extensive rubrics  which must define a ‘correct’ method’ of solving the problem.
  • EYFS assesses progress in characteristics such as resilience when we don’t even know if it can be taught and critical thinking and creativity when these are not constructs that can be generically assessed.

My experience at A level is just one indication of this bigger problem of inattention to validity of assessments in our education system.


16 thoughts on “Is reliability becoming the enemy of validity?

  1. This is purely anecdotal but the headteacher of a boys private school was of the opinion that the more sophisticated a pupil’s answer the less likely they would be to receive the marks. He found this to be borne out during re-marking, with a significant increase in marks for his students – often the top students.

    1. I find remarks don’t tend to make a difference, except on the odd random occasion because the answer *does* conform to the mark scheme and so officially has not been wrongly marked – a relaible result. There is enormous variation between subjects and boards though in approaches.

  2. I was wondering if you think the new Politics specification has achieved more valid results? I think the rank order at our school was roughly what I expected the first year and the move to more open essays helped this.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s