Part 2: Early years assessment is not reliable or valid and thus not helpful

This is the second post on early years assessment. The first is here

Imagine the government decided they wanted children to be taught to be more loving. Perhaps the powers that be could decide to make teaching how to love statutory and tell teachers they should measure each child’s growing capacity to love.

Typical scene in the EYFS classroom – a teacher recording observational assessment. 

There would be serious problems with trying to teach and assess this behaviour:

Definition: What is love? Does the word actually mean the same thing in different contexts? When I talk about ‘loving history’ am I describing the same thing (or ‘construct’) as when I ‘love my child’.

Transfer:  Is ‘love’ something that universalises between contexts? For example if you get better at loving your sibling will that transfer to a love of friends, or school or learning geography?

Teaching: Do we know how to teach people to love in schools? Are we even certain it’s possible to teach it?

Progress: How does one get better at loving? Is progress linear? Might it just develop naturally?

Assessment: If ‘loving skills’ actually exist can they be effectively measured?

 

 

 

 

Loving – a universalising trait that can be taught?

The assumption that we can teach children to ‘love’ in one context and they’ll exercise ‘love’ in another might seem outlandish but, as I will explain, the writers of early years assessment fell into just such an error in the Early Years Foundation stage framework and assessment profile.

In my last post I explained how the priority on assessment in authentic environments has been at the cost of reliability and has meant valid conclusions cannot be drawn from Early Years Foundation Stage Profile assessment data. There are, however, other problems with assessment in the early years…

Problems of ‘validity’ and ‘construct validity’

Construct validity: is the degree to which a test measures what it claims, or purports, to be measuring.

 Validity: When inferences can be drawn from an assessment about what students can do in other situations, at other times and in other contexts.

If we think we are measuring ‘love’ but it doesn’t really exist as a single skill that can be developed then our assessment is not valid. The inferences we draw from that assessment about student behaviour would also be invalid.

Let’s relate this to the EYFS assessment profile.

Problems with the EYFS Profile ‘characteristics of effective learning’

The EYFS Profile Guide requires practitioners to comment a child’s skills and abilities in relation to 3 ‘constructs’ labelled as ‘characteristics of effective learning’:

We can take one of these characteristics of effective learning to illustrate a serious problem of validity of the assessment. While a child might well demonstrate creativity and critical thinking (the third characteristic listed) it is now well established that such behaviours are NOT skills or abilities that can be learnt in one context and transferred to another entirely different context- they don’t universalise any more than ‘loving’. In fact the capacity to be creative or think critically is dependent on specific knowledge of the issue in question. Many children can think very critically about football but that apparent behaviour evaporates when faced with some maths.  You’ll think critically in maths because you know a lot about solving similar maths problems and this capacity won’t make you think any more critically when solving something different like a word puzzle or a detective mystery.

Creating and thinking critically are NOT skills or abilities that can be learnt in one context and then applied to another

Creating and thinking critically are not ‘constructs’ which can be taught and assessed in isolation. Therefore there is no valid general inference about these behaviours, which could be described as a ‘characteristic of learning’, observed and reported. If you wish a child to display critical thinking you should teach them lots of relevant knowledge about the specific material you would like them to think critically about.

In fact, what is known about traits such as critical thinking suggests that they are ‘biologically primary’ and don’t even need to be learned [see an accessible explanation here].

Moving on to another characteristic of effective learning: active learning or motivation. This presupposes that ‘motivation’ is also a universalising trait as well as that we are confident that we know how to inculcate it. In fact, as with critical thinking, it is perfectly possible to be involved and willing to concentrate in some activities (computer games) but not others (writing).

There has been high profile research on motivation, particularly Dweck’s work on growth mindset and Angela Duckworth’s on Grit. Angela Duckworth, has created a test that she argues demonstrates that adult subjects possess a universalising trait which she calls ‘Grit’. But even this world expert concedes that we do not know how to teach Grit and rejects her Grit scale being used for high stakes tests. Regarding Growth Mindset, serious doubts have been raised about failures to replicate Dweck’s research findings and studies with statistically insignificant results that have been used to support Growth Mindset.

Despite serious questions around the teaching of motivation, the EYFS Profile ‘characteristics of learning’ presume this is a trait that can be inculcated in pre-schoolers and without solid research evidence it is simply presumed it can be reliably assessed.

For the final characteristic of effective learning, playing and learning. Of course children learn when playing. This does not mean the behaviours to be assessed under this heading (‘finding out and exploring’, ‘using what they know in play’ or ‘being willing to have a go’) are any more universalising as traits or less dependent on context than the other characteristics discussed. It cannot just be presumed that they are.

Problems with the ‘Early Learning Goals’

 At the end of reception each child’s level of development is assessed against the 17 EYFS Profile ‘Early Learning Goals. In my previous post I discussed the problems with the reliability of this assessment. We also see the problem of construct validity in many of the assumptions within the Early Learning Goals. Some goals are clearly not constructs in their own right and others may well not be and serious questions need to be asked about whether they are universalising traits or actually context dependent behaviours.

For example, ELG 2 is ‘understanding’. Understanding is not a generic skill. It is dependent on domain specific knowledge. True, a child does need to know the meaning of the words ‘how’ and ‘why’ which are highlighted in the assessment but while understanding is a goal of education it can’t be assessed generically as you have to understand something and this does not mean you will understand something else. The same is true for ‘being imaginative’ ELG17.

An example of evidence of ELG 2, understanding, in the EYFS profile exemplification materials.

Are ELG1 ‘listening and attention’ or ELG 16 ‘exploring and using media materials’ actually universalising constructs? I rarely see qualitative and observational early years research that even questions whether these early learning goals are universalising traits, let alone looks seriously at whether they can be assessed. This is despite decades of research in cognitive psychology leading to a settled consensus which challenges many of the unquestioned constructs which underpin EYFS assessment.

It is well known that traits such as understanding, creativity, critical thinking don’t universalise. Why, in early years education, are these bogus forms of assessment not only used uncritically but allowed to dominate the precious time when vulnerable children could be benefiting from valuable teacher attention?

n.b. I have deliberately limited my discussion to a critique using general principles of assessment rather than arguments that would need to based on experience or practice.

Advertisements

Early years assessment is not reliable or valid and thus not helpful

The academic year my daughter was three she attended two different nursery settings. She took away two quite different EYFS assessments, one from each setting, at the end of the year. The disagreement between these was not a one off mistake or due to incompetence but inevitable because EYFS assessment does not meet the basic requirements of effective assessment – that it should be reliable and valid*.

We have a very well researched principles to guide educational assessment and these principles can and should be applied to the ‘Early Years Foundation Stage Profile’. This is the statutory assessment used nationally to assess the learning of children up to the age of 5. The purpose of the EYFS assessment profile is summative:

‘To provide an accurate national data set relating to levels of child development at the end of EYFS’

It is also used to ‘accurately inform parents about their child’s development’. The EYFS profile is not fit for these purposes and its weaknesses are exposed when it is judged using standard principles of assessment design.

EYFS profiles are created by teachers when children are 5 to report on their progress against 17 early learning goals and describe the ‘characteristics of their learning’. The assessment is through teacher observation. The profile guidance stresses that,

‘…to accurately assess these characteristics, practitioners need to observe learning which children have initiated rather than focusing on what children do when prompted.’

Illustration is taken from EYFS assessment exemplification materials for reading

Thus the EYFS Profile exemplification materials for literacy and maths only give examples of assessment through teacher observations when children are engaged in activities they have chosen to play (child initiated activities). This is a very different approach to subsequent assessment of children throughout their later schooling which is based on tests created by adults. The EYFS profile writers no doubt wanted to avoid what Wiliam and Black (Wiliam & Black, 1996) call the ‘distortions and undesirable consequences’ created by formal testing.

Reaching valid conclusions in formal testing requires:

  1.    Standard conditions – means there is reassurance that all children receive the same level of help
  2.    A range of difficulty in items used for testing – carefully chosen test items will discriminate between the proficiency of different children
  3.    Careful selection of content – from the domain to be covered to ensure they are representative enough to allow for an inference about the domain. (Koretz pp23-28)

The EYFS profile is specifically designed to avoid the distortions created by such restrictions that lead to an artificial test environment very different from the real life situations in which learning will need to be ultimately used. However, as I explain below, in so doing the profile loses necessary reliability to the extent that teacher observations cannot support valid inferences.

This is because when assessing summatively the priority is to create a shared meaning about how pupils will perform beyond school and in comparison with their peers nationally (Koretz 2008). As Wiliam and Black (1996) explain, ‘the considerable distortions and undesirable consequences [of formal testing] are often justified by the need to create consistency of interpretation.’ This is why GCSE exams are not currently sat in authentic contexts with teachers with clipboards (as in EYFS) observing children in attempted simulations of real life contexts. Using teacher observation can be very useful for an individual teacher when assessing formatively (deciding what a child needs to learn next) but the challenges of obtaining a reliable shared meaning nationally that stop observational forms of assessment being used for GCSEs do not just disappear because the children involved are very young.

Problems of reliability

Reliability: Little inconsistency between one measurement and the next (Koretz, 2008)

Assessing child initiated activities and the problem of reliability:

The variation in my daughter’s two assessments was unsurprising given that…

  • Valid summative conclusions require ‘standardised conditions of assessment’ between settings and this is not possible when observing child initiated play.
  • Nor is it possible to even create comparative tasks ranging in difficulty that all the children in one setting will attempt.
  • The teacher cannot be sure their observations effectively identify progress in each separate area as they have to make do with whatever children choose to do.
  • These limitations make it hard to standardise between children even within one setting and unsurprising that the two nurseries had built different profiles of my daughter.

The EYFS Profile Guide does instruct that practitioners ‘make sure the child has the opportunity to demonstrate what they know, understand and can do’ and does not preclude all adult initiated activities from assessment. However, the exemplification materials only reference child initiated activity and, of course, the guide instructs practitioners that

‘…to accurately assess these characteristics, practitioners need to observe learning which children have initiated rather than focusing on what children do when prompted.’

Illustration from EYFS assessment exemplification materials for writing. Note these do not have examples of assessment from written tasks a teacher has asked children to undertake – ONLY writing voluntarily undertaken by the child during play.

Assessing adult initiated activities and the problem of reliability

Even when some children are engaged in an activity initiated or prompted by an adult

  • The setting cannot ensure the conditions of the activity have been standardised, for example it isn’t possible to predict how a child will choose to approach a number game set up for them to play.
  • It’s not practically possible to ensure the same task has been given to all children in the same conditions to discriminate meaningfully between them.

Assessment using ‘a range of perspectives’ and the problem of reliability

The EYFS profile handbook suggests that:

‘Accurate assessment will depend on contributions from a range of perspectives…Practitioners should involve children fully in their own assessment by encouraging them to communicate about and review their own learning…. Assessments which don’t include the parents’ contribution give an incomplete picture of the child’s learning and development.’

A parent’s contribution taken from EYFS assessment exemplification materials for number

Given the difficulty one teacher will have observing all aspects of 30 children’s development it is unsurprising that the profile guide stresses the importance of contributions from others to increase the validity of inferences. However, it is incorrect to claim the input of the child or of parents will make the assessment more accurate for summative purposes. With this feedback the conditions, difficulty and specifics of the content will not have been considered creating unavoidable inconsistency.

Using child-led activities to assess literacy and numeracy and the problem of reliability

The reading assessment for one of my daughters seemed oddly low. The reception teacher explained that while she knew my daughter could read at a higher level the local authority guidance on the EYFS profile said her judgement must be based on ‘naturalistic’ behaviour. She had to observe my daughter (one of 30) voluntarily going to the book corner, choosing to reading out loud to herself at the requisite level and volunteering sensible comments on her reading.

 

Illustration is taken from EYFS assessment exemplification materials for reading Note these do not have examples of assessment from reading a teacher has asked children to undertake – ONLY reading voluntarily undertaken by the child during play.

The determination to preference assessment of naturalistic behaviour is understandable when assessing how well a child can interact with their peers. However, the reliability sacrificed in the process can’t be justified when assessing literacy or maths. The success of explicit testing of these areas suggests they do not need the same naturalistic criteria to ensure a valid inference can be made from the assessment.

Are teachers meant to interpret the profile guidance in this way? The profile is unclear but while the exemplification materials only include examples of naturalistic observational assessment we are unlikely to acquire accurate assessments of reading, writing and mathematical ability from EYFS profiles.

Five year olds should not sit test papers in formal exam conditions but this does not mean only observation in naturalistic settings (whether adult or child initiated) is reasonable or the most reliable option.  The inherent unreliability of observational assessment means results can’t support the inferences required for such summative assessment to be a meaningful exercise. It cannot, as intended ‘provide an accurate national data set relating to levels of child development at the end of EYFS’ or ‘accurately inform parents about their child’s development’.

In my next post I explore the problems with the validity of our national early years assessment.

 

*n.b. I have deliberately limited my discussion to a critique using assessment theory rather than arguments that would need to based on experience or practice.

References

Koretz, D. (2008). Measuring UP. Cambridge, Massachusetts: Harvard University Press.

Standards and Testing Agency. (2016). Early Years Foundation Stage Profile. Retrieved from https://www.gov.uk/government/publications/early-years-foundation-stage-profile-handbook

Standards and Testing Agency. (2016). Early Years Foundation Stage Profile: exemplification materials. Retrieved from https://www.gov.uk/government/publications/eyfs-profile-exemplication-materials

Wiliam, D., & Black, P. (1996). Meanings and consequences: a basis for distinguishing formative and summative functions of assessment? BERJ, 537-548.

Data Tracking and the LFs*

Until recently I was unfamiliar with the sorts of pupil tracking systems used in most schools. I’ve also recently had to get to grips with the plethora of acronyms commonly used to categorise groups of students being tracked. I’ve come across PP, LPAs, HPAs and LACs but, rather surprisingly, no mention of the LF. To be honest I am surprised by this gap given that in my considerable experience it is how the teacher and school manage the performance of the LFs that is most crucial to healthy end of year data. If the LFs perform near their potential you’re basically laughing all the way to the exam hall.

I should, at this stage, be clear. LF is not a standard acronym (it was invented by my husband) but it does describe a clearly recognisable and significant sub-section of any secondary school population. The L stands for lazy (and the second word begins with an F).

I am being very flippant, I know, but my point is serious enough.

Today I happened to need to look at a spreadsheet containing data for an old cohort from my last school. As my eye glanced down the baseline testing stats, used for tracking, I couldn’t help emitting frequent snorts of derision. The trigger of my scorn was the original baseline test data for some of my most ‘affectionately’ remembered GCSE students (truthfully, actually, I do remember them all with warmth). I commented to my husband that they needed to be real… erm… ‘LFs’ to score that low on the baseline given the brains with which I knew perfectly well that they were blessed.

If I and my colleagues had based our ambitions for those particular boys individuals on their predicted grade from the baseline they’d have cruised lazily through school. Their meagre efforts would have been continually affirmed as adequate which would have been ruinous for their habits and character and a betrayal of their potential.

If value added is what drives you it is also an obvious truth that if you effectively cap your ambitions for pupils by only showing concern when pupils don’t meet predicted grades from the baseline you’ll still have to absorb the scores of some pupils that just aren’t going to be able to live up to their predictions. Meanwhile you lose some of the scores of those that should do better than their baseline result suggests, that would otherwise balance everything out.

I think what bothers me most is the ‘inhumanity’ of a purely data driven approach to progress. How could school teachers, of all people, have devised a system that allows no room to acknowledge obvious human truth before our eyes? Exactly when weren’t and where aren’t some humans, sometimes, rather lazy? Down through the centuries school teachers have exercised their craft, ensuring pupils learn important things despite the entirely natural human propensity towards sloth, magnified in the teenage years. What made us think we could dispense with that wisdom, that our spreadsheets knew better?

Can we re-learn to teach the pupils that actually sit before us, responding to them using our hard-won expertise? Oh, I do hope so.

*Warning: this post nearly contains bad language.

The ‘quite tidy garden’ …or why level descriptors aren’t very helpful.

Dear Josh,

Thank you for agreeing to sort out our garden over your long holiday. As we’ll be away all summer here is a guide that tells you all you need to know to get

from this…

…to this

STEP A: You should begin by assessing the garden to decide its level. Read through these level descriptors to decide:

Level 1: Your garden is very overgrown. Any lawn has not been mown for some years. Shrubs have not been pruned for a considerable period. There are no visible beds and typically there will be large areas taken over by brambles and or nettles. There will probably be an abandoned armchair (or similar worn out furniture) somewhere in the overgrowth as well as assorted rubble and the old concrete base from a fallen shed. Boundary fencing will have collapsed.

Level 2: Your garden is just a little overgrown. The lawn is patchy though neglect and has only been mown sporadically. Shrubs generally have not been pruned recently. Beds look neglected and are not well stocked. There may be various forms of old rubbish abandoned in the far corners of the garden along with old lawn clippings and hedge trimmings. Boundary fences are in disrepair.

Level 3: Your garden is well tended. Lawns are mown regularly and contain no moss and weeds and shrubs are regularly pruned. Flower beds are well demarcated and contain no weeds. They are well stocked with appropriate bedding plants. The garden is quite tidy and boundary fencing is new and strong.

STEP B:

Josh, if you decide the garden is Level 1 (that is certainly our view) then I suggest you look at the Level 2 descriptor to guide you as to your next steps. It is clear that you need to move the garden from ‘very overgrown’ to ‘just a little overgrown’. For example, in a Level 1 garden, shrubs ‘have not been pruned for a considerable period’. You need to move on from that to a Level 2 garden where ‘shrubs have not been pruned recently’. The lawn needs to move from having ‘not been mown for some years’ to Level 2 ‘has only been mown sporadically’. Aim to move the boundary fencing on from Level 1 ‘will have collapsed’ to Level 2 ‘in disrepair’.  To move on from Level 1 for rubbish, for example, you’ll need to move that old armchair to a far corner of the garden.

STEP C:

Now move the garden from Level 2 to Level 3. This means you should ensure the garden is ‘well tended’ rather than ‘a little overgrown’. What useful advice!

Using level descriptors makes it so clear for you doesn’t it? Hubby is trying to insist that I also leave you his instructions but they are hopeless as he doesn’t understand that you need to know your next steps to make progress in gardening. He’s written reams and reams of advice including instructions like:

‘You’ll find the strimmer in the garage’

‘Start by clearing all the nettles’

‘Ken will come and help you shift the concrete’

‘The tip is open from 10-4 at weekends’

‘Marion next door can advise you about the best bedding plants to buy’

His instructions are just too specific to our garden. To learn the gardening skills that will achieve a Level 3 garden what you need is to really know your next level targets. I won’t confuse you by leaving you his nonsense!

We’ll see you in September and in the meantime we wish you happy gardening!

 

With apologies to any actual gardeners out there who know what they are talking about and enormous thanks to Daisy Christodoulou whose recent book helped me appreciate just why we shouldn’t use level descriptors as feedback. 

Knowledge organisers: fit for purpose?

Definition of a knowledge organiser: Summary of what a student needs to know that must be fitted onto an A4 sheet of paper.

Desk bins: Stuff I Don't Need to Know...
Desk bins: Stuff I Don’t Need to Know…

If you google the term ‘knowledge organisers’ you’ll find a mass of examples. They are on sale on the TES resource site – some sheets of A4 print costing up to £7.50. It seems knowledge organisers have taken off. Teachers up and down the country are beavering away to summarise what needs to be known in their subject area.

It is good news that teachers are starting to think more about curriculum. More discussion of the ‘what’ is being taught, how it should be sequenced and how it can be remembered is long overdue. However, I think there is a significant weakness with some of these documents. I looked at lots of knowledge organisers to prepare for training our curriculum leaders and probably the single biggest weakness I saw was a confusion over purpose.

 

I think there are three very valid purposes for knowledge organisers:

  1. Curriculum mapping – for the TEACHER

Identifying powerful knowledge, planning to build schemas, identifying transferable knowledge and mapping progression in knowledge.

  1. For reference – for the PUPIL

In place of a textbook or a form of summary notes for pupils to reference.

  1. A list of revision items – for the PUPIL (and possibly the parents)

What the teacher has decided ALL pupils need to know as a minimum at the end of the topic.

 

All three purposes can be valid but when I look at the mass of organisers online I suspect there has often been a lack of clarity about the purpose the knowledge organiser is to serve.

Classic confusions of purpose:

  1. Confusing a curriculum mapping document with a reference document:

A teacher sits down and teases out what knowledge seems crucial for a topic. As they engage in this crucial thinking they create a dense document full of references that summarises their ideas. So far so good…but a document that summarises a teacher’s thinking is unlikely to be in the best format for a child to use. The child, given this document, sees what looks like a mass of information in tiny text, crammed onto one sheet of A4. They have no real notion of which bits to learn, how to prioritise the importance of all that detail or apply it. This knowledge is self-evident to the teacher but not the child.

  1. Confusing a knowledge organiser with a textbook:

Teachers who have written textbooks tell me that there is a painstaking editorial process to ensure quality. Despite this there is a cottage industry of teachers writing series of knowledge organisers which amount to their own textbooks. Sometimes this is unavoidable. Some textbooks are poor and some topics aren’t covered in the textbooks available. Perhaps sometimes the desperate and continual begging of teachers that their school should prioritise the purchase of textbooks falls on deaf ears and teachers have no choice but to spend every evening creating their own textbooks photocopied on A4 paper…

…but perhaps we all sometimes need to remind ourselves that there is no virtue in reinventing the wheel.

  1. Confusing a textbook with summary notes:

The information included on an A4 sheet of paper necessarily lacks the explanatory context contained in a textbook or detailed notes. If such summaries are used in place of a textbook or detailed notes the student will lack the explanation they need to make sense of the detail.

  1. Confusing a reference document or notes with a list of revision items for a test

If we want all pupils to acquire mastery of some basics we can list these basic facts we have identified as threshold knowledge in a knowledge organiser. We can then check that the whole class know these facts using a test. The test requires the act of recall which also strengthens the memory of these details in our pupils’ minds.

Often, however, pupils are given reference documents to learn. In this situation the details will be too extensive to be learnt for one test. It is not possible to expect the whole class to know everything listed and so the teacher cannot ensure that all pupils have mastered some identified ‘threshold’ facts. Weaker students will be very poor at recognising what are the most important details they should focus on learning, poor at realising what is likely to come up in a test and the format in which it will be asked. Many will also find a longer reference document contains an overwhelming amount of detail and give up. The chance to build self-efficacy and thus self-esteem has been lost.

 

If you are developing knowledge organisers to facilitate factual testing then your focus is on Purpose C – creating a list of revision items. Below is a list of criteria I think are worth considering:

  1. Purpose (to facilitate mastery testing of a list of revision items)
  • Exclude knowledge present for the benefit of teacher
  • Exclude explanatory detail which should be in notes or a textbook.
  1. Amount
  • A short topic’s worth (e.g. two weeks teaching at GCSE)
  • An amount that all in the class can learn
  • Careful of expectations that are too low and if necessary ramp up demand once habit in place.
  1. Threshold or most ‘powerful’ knowledge
  • Which knowledge is necessary for the topic?
  • Which knowledge is ‘collectively sufficient’ for the topic?
  • Which knowledge will allow future learning of subsequent topics?
  • Which knowledge will best prompt retrieval of chunks of explanatory detail?
  • CUT any extraneous detail (even if it looks pretty)
  • Include relevant definitions, brief lists of factors/reasons arguments, quotes, diagrams and summaries etc.
  • Check accuracy (especially when adapting internet finds)
  1. Necessary prior knowledge
  • Does knowledge included in the organiser presume grasp of other material unlikely to yet be mastered?
  1. Concise wording
  • Is knowledge phrased in the way you wish it to be learned?

Happy knowledge organising!

 

One approach to regular, low stakes and short factual tests.

I find the way in which the Quizlet app has taken off fascinating. Millions (or billions?) has been pumped into ed tech but Quizlet did not take off because education technology companies marketed it to schools. Pupils and teachers had to ‘discover’ Quizlet. They appreciated it’s usefulness for that most basic purpose of education – learning. The growth of Quizlet was ‘bottom up’ while schools continue to have technological solutions looking for problems thrust upon them from above. What an indictment of the ed tech industry.

There has been a recent growth of interest in methods of ensuring students learn long term the content they have been taught. This is in part due to the influence of research in cognitive psychology but also due to some influential education bloggers such as Joe Kirby and the changing educational climate caused by a shift away from modular examinations. Wouldn’t it be wonderful if innovation in technology focused on finding simple solutions to actual problems (like Quizlet) instead of chasing Sugata Mitra’s unicorn of revolutionising learning?

In the meantime we must look for useful ways to ensure students learn key information without the help of the ed tech industry. I was very impressed by the ideas Steve Mastin shared at the Historical Association conference yesterday but I realised I had never blogged about my own approach and its pros and cons compared with others I have come across.

I developed a system of regular testing for our history and politics department about four years ago. I didn’t know about the research from cognitive psychology back then and instead used what I had learnt from using Direct Instruction programmes with my primary aged children.

Key features of this approach to regular factual testing at GCSE and A level:

  • Approximately once a fortnight a class is given a learning homework, probably at the end of a topic or sub topic.
  • All children are given a guidance sheet that lists exactly what areas will come up in the test and need to be learnt. Often textbook page references are provided so key material can be easily located.

AAAAA Test

  • The items chosen for the test reflect the test writer’s judgement of what constitute the very key facts that could provide a minimum framework of knowledge for that topic (n.b. the students are familiar with the format and know how much material will be sufficient for an ‘explain’ question.) The way knowledge has been presented in notes or textbook can make it easier or more difficult for the students to find relevant material to learn. In the example above the textbook very conveniently summarises all they need to know.
  • The test normally takes about 10-15 minutes of a lesson. The test is always out of 20 and the pass mark is high, always 14/20. Any students who fail the test have to resit it in their own time. We give rewards for full marks in the test. The test writer must try and ensure that the test represents a reasonable amount to ask all students to learn for homework or the system won’t work.
  • There is no time limit for the test. I just take them in when all are finished.

I haven’t developed ‘knowledge organisers’, even though I can see the advantages of them because I don’t want to limit test items to the amount of material that can be fitted onto one sheet of paper. Additionally, I’ve always felt a bit nervous about sending the message that there is something comprehensive about the material selected for testing. I’ve found my approach has some advantages and disadvantages.

Advantages of this approach to testing:

  • It is regular enough that tests never have to cover too much material and become daunting.
  • I can set a test that I can reasonably expect all students in the class to pass if they do their homework.
  • The regularity allows a familiar routine to develop. The students adjust to the routine quickly and they quite like it.
  • The guidance sheet works better than simply telling students which facts to learn. This is because they must go back to their notes or textbook and find the information which provides a form of review and requires some active thought about the topic.
  • The guidance sheet works when it is clear enough to ensure all students can find the information but some thought is still necessary to locate the key points.
  • Test questions often ask students to use information in the way they will need to use it in extended writing. For example I won’t just ask questions like “When did Hitler come to power”. I will also ask questions like “Give two reasons why Hitler ordered the Night of the Long Knives”.
  • Always making the test out of 20 allows students to try and beat their last total. The predictability of the pass mark also leads to acceptance of it.
  • Initially we get lots of retakers but the numbers very quickly dwindle as the students realise the inevitability of the consequence of the failure to do their homework.
  • The insistence on retaking any failed tests means all students really do end up having to learn a framework of key knowledge.
  • I’ve found that ensuring all students learn a minimum framework of knowledge before moving on has made it easier to teach each subsequent topic. There is a lovely sense of steadily accumulating knowledge and understanding. I also seem to be getting through the course material faster despite the time taken for testing.

Disadvantages of my approach to testing:

  • It can only work in a school with a culture of setting regular homework that is generally completed.
  • Teachers have to mark the tests because the responses are not simple factual answers. I think this is a price worth paying for a wider range of useful test items but I can see that this becomes more challenging depending on workload.
  • There is no neat and simple knowledge organiser listing key facts.
  • We’re fallible. Sometimes guidance isn’t as clear as intended and you need to ensure test materials really are refined for next year and problems that arise are not just forgotten.
  • If you’re not strict about your marking your class will gradually learn less and less for each point on the guidance sheet.
  • This system does not have a built in mechanism for reviewing old test material in a systematic way.

We have just not really found that lower ability students (within an ability range of A*-D) have struggled. I know that other schools using similar testing with wider ability ranges have not encountered significant problems either. Sometimes students tell us that they find it hard to learn the material. A few do struggle to develop the self discipline necessary to settle down to some learning but we haven’t had a student who is incapable when they devote a reasonable amount of time. Given that those complaining are usually just making an excuse for failure to do their homework I generally respond that if they can’t learn the material for one tiny test how on earth are they proposing to learn a whole GCSE? I check that anyone that fails a test is revising efficiently but after a few retakes it transpires that they don’t, after all, have significant difficulties learning the material. Many students who are weak on paper like the tests.

We also set regular tests of chronology. At least once a week my class will put events printed onto cards into chronological order and every now and then I give them a test like the one below after a homework or two to learn the events. I don’t have to mark these myself – which is rather an advantage!

AAAA Test Photo

 

I very much liked Steve Mastin’s approach of giving multiple choice tests periodically which review old material. Good multiple choice questions can be really useful but are very hard to write. Which brings me back to my first point. Come on education technology industry! How about dropping the development of impractical, rather time consuming and gimmicky apps. We need those with funding and expertise to work in conjunction with curriculum subject experts to develop genuinely useful and subject specific forms of assessment.  It must be possible develop products that can really help us assess and track success learning the key information children need to know in each subject.

Testing – a double edged sword.

As teachers we’ve all had those moments when, eyes shining, tongue loosed by the excitement of the moment, we share a fascinating nugget of detail with our class. We’ve all also experienced the dull deflation of that enthusiasm when our students respond “But is this in the exam? Do we actually need to know this?” It seems our focus on testing has created a generation of students who view their studies purely as a means to an end and have lost the ability to enjoy learning for its own sake. Such responses from my classes normally trigger an agony of soul searching on my part. I question whether my desire to get my students good results means these sorts of responses are my own fault, a just retribution for my desire to show off my teaching prowess through a healthy end of year results spreadsheet. The same problem is seen with primary children asking what they need to do to get to the next level rather than enquiring further into a subject. We see the same problem with GCSE English courses in which the easiest books are chosen and only read in extract form, to optimize exam results. I also despise (yes it is that strong) the nonsensical hoop jumping drill that consumes hours of teaching time and is only to ensure student responses conform to exam rubrics so they can get the marks they deserve.

These drawbacks of testing are explained in a blog post by Daisy Christodoulou who took part in a debate recently with Toby Young, Tristram Hunt and Tony Little on the subject of testing. Daisy explains that the proposition of the debate was that tests were ‘essentially a necessary evil…in many ways inimical to good education…Tony Little said that our focus should not be on exams, but on ensuring a love of learning.’ In her post Daisy argues coherently that testing is nonetheless very useful for the reliable feedback it provides and the way the ‘testing effect’ aids memory. I agree but would go further than arguing for teacher set tests. I question the assumption that external exams such as GCSEs and A levels are just a necessary evil, inimical to good education. I’ll explain further.

A week ago my school had their year 13 parent’s evening. The talk was all of university applications and predicted grades. Students had been investigating universities and realisation had dawned that they were not going to get to the prestigious institutions their ambitions desired without those crucial A grades. Every year students that had never quite been able to take their studies seriously wise up to reality, you can see a new purpose in their demeanour as they ‘set aside childish things’ and get down to some serious study. External exams are essential for good education because without them too many students would never summon up that motivation to learn, or to learn enough, in enough detail and never reach a standard they would otherwise be capable of. Witness what happens when teachers are told their subject will still be taught but no longer examined at GCSE or A level. You may have noticed the campaigns to stop A levels being scrapped in languages such as Polish. Teachers know perfectly well that what is examined generally IS what is taken seriously. Where exams aren’t used other forms of competition tend to arise to serve the same purpose.

The assumption that motivation in education should be intrinsic goes pretty unquestioned but while most teachers would profess to believe this, their behaviour would suggest otherwise. Why is it that every year children are under so much stress from SATs? The children have no reason to take these seriously. It is the teachers that explain the importance of these tests to the pupils – to ensure they take the tests seriously, that they pay attention and work hard. Researchers expect a significant diminution in performance on tests when the stakes are low and have to factor this into their analysis.

Eric Kalenze said in his talk at ResearchEd that extrinsic motivation is seriously underrated in education and I agree with him. On the one hand we must avoid bribing children when they would or could work happily with no reward, this is clearly counterproductive. We also want to skilfully withdraw extrinsic rewards as we can see the children are becoming capable of appreciating the content for its own sake. We want to stimulate our students’ curiosity, help them to appreciate what they are learning. However, human motivation is complex. Just how many children ever would learn to their full potential with only intrinsic motivators? I’ve certainly heard of some but even then enthusiasms tend to be selective. I can’t help thinking that if avoidance of extrinsic motivators was an educational panacea Steiner Schools would have taken off in a way they never have.

Just how many students would be sitting in our secondary schools or our A level classes if it were not compulsory and they didn’t need proof of their learning for success later in life? Could it be that external exams rather than being harmful to deeper learning are actually the very REASON why children end up learning lots? If, at 16, it had made no difference to my future whether I understood maths GCSE I might just have spent more time following my enthusiasm for 19thC novels and neglected mathematics entirely. I have also known countless students fall in love with a subject as they study but the initial impetus for that study was the desire for exam success. To really excel in a subject takes serious hard work and discipline. Often the rewards of study are only really appreciated after much toil. Even as an adult can I really say that my motivation to learn things I find interesting is purely for its own sake? So often that genuine curiosity is mixed with a wish for acknowledgment of our erudition or a desire to bolster our own self esteem through feeling learned.

Exams are a double edged sword. True, that focus on exam success over the subject matter taught for its own sake is undoubtedly harmful. We must work to limit that harm while acknowledging that exam certificates are often the very reason our students choose to study. The idea most children would learn more without exams is untested idealism and ignores lived reality.

Every September I ask my new year 12 politics students why they are studying A levels. Every year they tell me it is so they can go to university and get a good career. At the end of every year I ask them if they are pleased they now understand so much more about politics – and they are. Job done!

Is reliability becoming the enemy of validity?

What would happen if I, as a history graduate, set out to write a mark scheme for a physics GCSE question? I dropped physics after year 9 but I think it is possible I could devise some instructions to markers that would ensure they all came up with the same marks for a given answer. In other words my mark scheme could deliver a RELIABLE outcome. However, what would my enormously experienced physics teacher husband think of my mark scheme? I think he’d either die of apoplexy or from laughing so hard:

“Heather, why on earth should they automatically get 2 marks just because they mentioned the sun? You’ve allowed full marks for students using the word gravity…”

After all I haven’t a notion how to effectively discriminate between different levels of understanding of physics concepts. My mark scheme might be reliable but it would not deliver a valid judgement of the students’ understanding of physics.

A few weeks ago at ResearchEd the fantastically informed  Amanda Spielman gave a talk on research Ofqual has done into the reliability of exam marking. Their research and that of Cambridge Assessment suggest marking is more RELIABLE than has been assumed by teachers. This might surprise teachers familiar with this every summer:

It is late August. A level results are out and the endless email discussions begin:

Hi Heather, Jake has emailed me. He doesn’t understand how he got an E on Politics A2 Unit 4 when he revised just as much as for Unit 3 (in which he got a B). I advised him to get a remark. Best, Alan

Dear Mrs F, My remark has come back unchanged and it means I’ve missed my Uni offer. I just don’t understand how I could have got an E. I worked so hard and had been getting As in my essays by the end. Would you look at my exam paper if I order it and see what you think? Thanks, Jake

Hi Alan, I’ve looked Jake’s paper. I though he must have fluffed an answer but all five answers have been given D/E level marks. I just don’t get it. He’s written what we taught him. Maybe the answers aren’t quite B standard – but E? Last year this unit was our best paper and this year it is a car crash. I’ll ask Clarissa if we can order her paper as she got full uniform marks. It might give some insight.  Heather

Alan, I’ve looked at Clarissa’s paper. See what you think. It is a great answer. She learns like a machine and has reproduced the past mark scheme. Jake has made lots of valid points but not necessarily the ones in the mark scheme. Arguably they are key, but then again, you could just as easily argue other points are as important and how can such decent answers end up as E grade even if they don’t hit all mark scheme bullet points precisely? I just despair. How can we continue to deliver this course with conviction when we have no idea what will happen in the exam each year? Heather

I don’t like to blow my own trumpet but the surprisingly low marks on our A2 students’ politics papers was an aberration from what was a fantastic results day for our department this year:

Hi Heather, OMG those AS history results are amazing!!!! Patrick an A!!!! I worried Susie would get a C and she scored 93/100, where did that come from? Trish

I don’t tend to quibble when the results day lottery goes our way but I can admit that it is part of the same problem. Marking of subjects such as history and politics will always be less reliable than in maths and we must remember it is the overall A level score (not the swings between individual module results) that needs to be reliable. But… even so… there seems to be enormous volatility in our exam system. The following are seen in my department every year:

  1. Papers where the results have a very surprising (wrong) rank order. Weak students score high As while numerous students who have only ever written informed, insightful and intelligent prose have D grades.
  2. Students with massive swings in grades between papers (e.g. B on one and E on the other) despite both papers being taught by the same teacher and with the same general demands.
  3. Exam scripts where it is unclear to the teacher why a remark didn’t lead to a significant change in the result for a candidate.
  4. Quite noticeable differences in the degree of volatility over the years in results depending on paper, subject (history or politics in my case) and even exam board.

Cambridge Assessment have been looking into this volatility and suggested that different markers ARE coming up with similar enough marks for the same scripts – marking is reliable enough. However, it is then assumed by the report writers that all other variation must be at school/student level. There is no doubt that there are a multitude of school and student level factors that might explain volatility in results such as different teachers  covering a course, variations in teaching focus or simply that a student had a bad day. However, why was no thought given to whether lack of validity explains volatility in exam results?

For example, I have noticed a trend in our own results at GCSE and A level. The papers with quite flexible mark schemes, with more reliance on marker expertise, deliver more consistent outcomes closer to our own expectations of the students. It looks like attempts to make our politics A level papers more reliable have simply narrowed the range of possible responses that get reward limiting the ability of the assessment to discriminate effectively between student responses. Organisations such as HMC know there is a problem but perhaps overemphasise the impact of inexperienced markers.

The mounting pressure on exam boards from schools has driven them to make their marking ever more reliable but this actually leads to increases in unexpected grade variation and produces greater injustices as the assessment becomes worse at discriminating between candidates. This process is exacerbated by the loss of face to face standardisation meetings (and in subjects such as politics markers unused to teaching the material) and thus markers are ever more dependent and/or tied to the mark scheme in front of them to guide their decision making. If students regularly have three grades difference between modules perhaps the exam board should stop blathering on about the reliability of their systems and start thinking about the validity of their assessment.

The drive for reliability can too often be directly at the expense of validity.

It is a dangerously faulty assumption that if marking is reliable then valid inferences can be drawn from the results. We know that for some time the education establishment has been rather blasé about the validity of its assessments.

  • Apparently our country’s school children have been marching fairly uniformly up through National Curriculum levels, even though we know learning is not actually linear or uniform. It seems that whatever the levels presumed to measure it was not giving a valid snapshot of progress.
  • I’ve written about how history GCSE mark schemes assume a spurious progression in (non-existent) generic analytical skills.
  • Too often levels of response mark schemes are devised by individuals with little consideration of validity.
  • Dylan Wiliam points out that reliable assessment of problem solving often requires extensive rubrics  which must define a ‘correct’ method’ of solving the problem.
  • EYFS assesses progress in characteristics such as resilience when we don’t even know if it can be taught and critical thinking and creativity when these are not constructs that can be generically assessed.

My experience at A level is just one indication of this bigger problem of inattention to validity of assessments in our education system.

Alphabet Soup – a post about A level marking.

Cambridge Assessment tell us that ‘more than half of teachers’ grade predictions are wrong’ at A level. There is an implication behind this headline that many teachers don’t know their students and perhaps have to be ‘put right’ by the cold hard reality of exam results.

What Cambridge Assessment neglect to say is that greater accuracy in predicting A level module results would actually need prophetic powers. No teacher, however good, can predict module results with any reliability because many results are VERY unpredictable. I teach history and politics, humanities subjects and so likely to have the less consistent results than subjects like sciences. Our history results have been relatively predictable of late but our A level politics results are another matter.

To illustrate we will now play a game. Your job is to look at a selection of real data from this year and predict each student’s results using their performance in previous modules as a guide. You might think that perhaps your lack of knowledge of each student’s ability will be a hindrance to your ability to make accurate predictions. Hmm well, we’ll see! (If you are a party pooper and don’t want to play you can scroll forward to the last table and see all the answers…)

It may help you to know that in politics the two AS and the two A2 modules follow similar formats and are of similar difficulty (perhaps Unit 2 is a little bit harder than Unit 1) and so a teacher will probably predict the same grade for each module as there is no reason why grades will vary markedly between modules that a teacher can anticipate. The A2 modules are a bit harder than AS modules and so a teacher will bear this in mind when predicting A2 results with the AS grades in front of them. (That said students mature and frequently do better at A2 than AS so this isn’t an entirely safe trend to anticipate.)

So using the Unit 1 grades on the first table below, what might you predict for these students’ Unit 2 module? (Remember the teacher will necessarily have given the exam board the same predicted grade for Unit 1 and 2.)

 

Candidate Unit 1 [AS] Unit 2 [AS] Unit 3 [A2] Unit 4 [A2]
1 B
2 B
3 B
4 A
5 B
6 A
7 A
8 B
9 A
10 A
11 A

(The overall results will soon be in the public domain and I have anonymised the students by not including the whole cohort (which was 19) and not listing numerical results (which interestingly would actually make prediction even harder if included). I have not listed results at retake as these would not be available when predictions were made. We use Edexcel.)

Check the table below to see if you were right.

Candidate Unit 1 [AS] Unit 2 [AS] Unit 3 [A2] Unit 4 [A2]
1 B B
2 B C
3 B E
4 A C
5 B E
6 A E
7 A D
8 B B
9 A B
10 A A
11 A A

 

Were you close? Did you get as much as 50%? If you simply predicted the same grade for Unit 2 as scored in Unit 1 (as teachers generally would) you could have only got 7/19 of the full cohort correct.

Now don’t look at the table below until you have tried to predict the first A2 grade! Go on! Just jot down your predictions on the back of an envelope. (Isn’t this fun?)

Candidate Unit 1 [AS] Unit 2 [AS] Unit 3 [A2] Unit 4 [A2]
1 B B D
2 B C E
3 B E B
4 A C E
5 B E D
6 A E D
7 A D E
8 B B C
9 A B D
10 A A A
11 A A C

Rather an unpredictably steep drop there! It was all the more puzzling for us given that our A2 history results (same teachers and many of the same students) were great.

If you’ve got this far why not predict the final module?

 

Candidate Unit 1 [AS] Unit 2 [AS] Unit 3 [A2] Unit 4 [A2]
1 B B D C
2 B C E B
3 B E B E
4 A C E D
5 B E D B
6 A E D D
7 A D E D
8 B B C B
9 A B D D
10 A A A A
11 A A C C

 

We have no idea why there is a steep drop at A2 this year.  Perhaps our department didn’t quite ‘get’ how to prepare our students for these papers in particular. I doubt it – but if so what does this say about our exams if seasoned and successful teachers can fail to see how to prepare students for a particular exam despite much anxious poring over past papers and mark schemes? Our politics AS results were superb this year for both modules.  Our department’s History results were fantastic at AS and A2. These results were entirely UNPREDICTABLE. Or to put it another way, they were only predictable in the sense that we anticipated in advance that they would look like alphabet soup – because they often do.

The first point that should be clear is that no teacher could possibly predict these module results. To even make the statement ‘more than half of teachers’ grade predictions are wrong’ is to wilfully mislead the reader as to what the real issue is here.

In fairness, it might feel like the grades have been created by a random letter generator but the results aren’t quite random. Some very able students, such as candidate 10, get As on all papers and would only ever have had A predictions. The final grades of our students were generally within one grade of expectations so the average of the four results has some validity. This said, surely there is a more worrying headline that the TES should have run with?

Just how good are these exams at reliably discriminating between the quality of different candidates? It is argued that marking is reliable but what does this then tell us about the discriminatory strength of the exams themselves or their mark schemes? It can’t have helped that there were only 23 marks (out of 90) between an A and an E on Unit 3 or 24 marks between those grades on Unit 4. I have discussed other reasons I think may be causing the unpredictability here and here.

Not all exams are as unpredictable as our Government and Politics A level but if we want good exams that reliably and fairly discriminate between students we need to feel confident we know why some exams have such unpredictable outcomes currently. Ofqual has been moving in that direction with their interesting studies of maths GCSE and MFL at A level. As well as being unfair in its implication the Cambridge Assessment headline is simply unhelpful as it obscures the real problems.