Performance assessments may not be ‘reliable’ or ‘valid.’ So what?
In a comment on Dan Willingham’s recent post, I said
we have plenty of alternatives that have been offered, over and over again, to counteract our current over-reliance on – and unfounded belief in – the ‘magic’ of bubble sheet test scores. Such alternatives include portfolios, embedded assessments, essays, performance assessments, public exhibitions, greater use of formative assessments (in the sense of Black & Wiliam, not benchmark testing) instead of summative assessments, and so on. . . . We know how to do assessment better than low-level, fixed-response items. We just don’t want to pay for it…
I don’t think money is the problem. These alternatives are not, to my knowledge, reliable or valid, with the exception of essays.
And therein lies the problem… (with this issue in general, not with Dan in particular)
Most of us recognize that more of our students need to be doing deeper, more complex thinking work more often. But if we want students to be critical thinkers and problem solvers and effective communicators and collaborators, that cognitively-complex work is usually more divergent rather than convergent. It is more amorphous and fuzzy and personal. It is often multi-stage and multimodal. It is not easily reduced to a number or rating or score. However, this does NOT mean that kind of work is incapable of being assessed. When a student creates something – digital or physical (or both) – we have ways of determining the quality and contribution of that product or project. When a student gives a presentation that compels others to laugh, cry, and/or take action, we have ways of identifying what made that an excellent talk. When a student makes and exhibits a work of art – or sings, plays, or composes a musical selection – or displays athletic skill – or writes a computer program – we have ways of telling whether it was done well. When a student engages in a service learning project that benefits the community, we have ways of knowing whether that work is meaningful and worthwhile. When a student presents a portfolio of work over time, we have ways of judging that. And so on…
If there is anything that we’ve learned (often to our great dismay) over the last decade, it’s that assessment is the tail that wags the instructional, curricular, and educational dogs. If we continue to insist on judging performance assessments with the ‘validity’ and ‘reliability’ criteria traditionally used by statisticians and psychometricians, we never – NEVER – will move much beyond factual recall and procedural regurgitation to achieve the kinds of higher-level student work that we need more of.
The upper ends of Bloom’s taxonomy and/or Webb’s Depth of Knowledge levels probably can not – and likely SHOULD not – be reduced to a scaled score, effect size, or regression model without sucking the very soul out of that work. As I said in another comment on Dan’s post, “What score should we give the Mona Lisa? And what would the ‘objective’ rating criteria be?” I’m willing to confess that I am unconcerned about the lack of statistical ‘validity’ and ‘reliability’ of authentic performance assessments if we are thoughtful assessors of those activities.
How about you? Dan (or others), what are your thoughts on this?