Performance assessments may not be ‘reliable’ or ‘valid.’ So what?


In a comment on Dan Willingham’s recent post, I said

we have plenty of alternatives that have been offered, over and over again, to counteract our current over-reliance on – and unfounded belief in – the ‘magic’ of bubble sheet test scores. Such alternatives include portfolios, embedded assessments, essays, performance assessments, public exhibitions, greater use of formative assessments (in the sense of Black & Wiliam, not benchmark testing) instead of summative assessments, and so on. . . . We know how to do assessment better than low-level, fixed-response items. We just don’t want to pay for it…

Dan replied

I don’t think money is the problem. These alternatives are not, to my knowledge, reliable or valid, with the exception of essays.

And therein lies the problem… (with this issue in general, not with Dan in particular)

Most of us recognize that more of our students need to be doing deeper, more complex thinking work more often. But if we want students to be critical thinkers and problem solvers and effective communicators and collaborators, that cognitively-complex work is usually more divergent rather than convergent. It is more amorphous and fuzzy and personal. It is often multi-stage and multimodal. It is not easily reduced to a number or rating or score. However, this does NOT mean that kind of work is incapable of being assessed. When a student creates something – digital or physical (or both) – we have ways of determining the quality and contribution of that product or project. When a student gives a presentation that compels others to laugh, cry, and/or take action, we have ways of identifying what made that an excellent talk. When a student makes and exhibits a work of art – or sings, plays, or composes a musical selection – or displays athletic skill – or writes a computer program – we have ways of telling whether it was done well. When a student engages in a service learning project that benefits the community, we have ways of knowing whether that work is meaningful and worthwhile. When a student presents a portfolio of work over time, we have ways of judging that. And so on…

If there is anything that we’ve learned (often to our great dismay) over the last decade, it’s that assessment is the tail that wags the instructional, curricular, and educational dogs. If we continue to insist on judging performance assessments with the ‘validity’ and ‘reliability’ criteria traditionally used by statisticians and psychometricians, we never – NEVER – will move much beyond factual recall and procedural regurgitation to achieve the kinds of higher-level student work that we need more of.

The upper ends of Bloom’s taxonomy and/or Webb’s Depth of Knowledge levels probably can not – and likely SHOULD not – be reduced to a scaled score, effect size, or regression model without sucking the very soul out of that work. As I said in another comment on Dan’s post, “What score should we give the Mona Lisa? And what would the ‘objective’ rating criteria be?” I’m willing to confess that I am unconcerned about the lack of statistical ‘validity’ and ‘reliability’ of authentic performance assessments if we are thoughtful assessors of those activities.

How about you? Dan (or others), what are your thoughts on this?

Image credit: Meh, Ken Murphy

5 Responses to “Performance assessments may not be ‘reliable’ or ‘valid.’ So what?”

  1. Relating to Bloom’s Taxonomy most (if not all) authentic performance assessments measure the knowledge, comprehension, application, then skip to the evaluation, skipping over the analysis and synthesis. A truly valuable assessment measures all the areas so we can more accurately gauge the level of comprehension students have.

  2. The importance of reliability and validity depends on the purpose to which the assessment will be put. If the purpose is formative or if it’s summative and meant only for the student, then I agree it’s crazy to fret about these issues.
    But if the purpose is to gather information as to whether a school or teacher is doing a reasonable job with students, then it’s obviously pretty important. “Reliable and valid” pretty much boils down to “does the assessment mean anything, or could you just as well roll the dice?”
    So I’m disagreeing with your claim that we have ways of determining the quality of projects and other high level assessments. Too much of the quality lies in the eyes of the beholder. Example: Tony Wagner told me he will often show a classroom clip to a room full of teachers and ask them to grade the quality of instruction, A to F. He told me that the range of assigned grades is virtually always A to D.
    I resonate to the problem you’re pointing out–we can’t measure a lot of outcomes that we care about. I mentioned on my blog posting that I these metrics are “partial” but I didn’t emphasize this limitation. You also point out here a factor I did emphasize–that setting stakes for schools and teachers on measures that we admit are limited is recipe for over-emphasis in schooling of whatever is tested.
    I think there are approaches that would be useful–for example, greater effort to evaluate quality of *teaching* not just outcomes (which the AFT has worked on, as well as some academics).
    For me, the starting point is to agree that there are valid reasons to want to measure student progress, and to admit that the methods we are using now have negative–sometimes very negative–unintended impacts of schools, teachers, and students.

  3. Doug Christensen said it best – these things cannot both be policy tools and pedagogy tools. And as long as we test all kids all the time, they will try to be.

    I’m in favor of a national STARS project. School-based reporting, based on the hard work of creating school-based standards, etc… with low-stakes testing (it could even be sampling if schools are large enough) for the sake of inter-rater reliability.

  4. Scott, any of the options that you suggest are as reliable and valid as the standardized tests currently being used to measure things that the tests weren’t designed to measure.

    Reliability and validity are concepts that are fluid and very dependent on the context and structure in which they live.

    The current context is a schematic for dismantling public education.

  5. Chris has covered two of the points that came to mind while reading the post. I’d add to books to the pile:

    The Testing Charade by Dan Koretz
    Beyond Testing by Deb Meier and Matt Knoester

    Regarding valid and reliable, these are terms that mean something specific to researchers and which researchers use in a specific way. Unless you are a classroom teacher with a decent command of statistics and quantitative methods, you’re not going to read those words in the same way. Perhaps more importantly, you’re not going to have the footing/confidence/background to question them.

    Finally, to the point and dismissal of the point about financing and what’s cheap, we know valid and reliable practices about increasing motivation in learners, promoting lifelong reading, and deeper thinking, but these practices and the capacity-building necessary to shift a school toward them are not inexpensive. No one really needs to build capacity around administering standardized tests.

Leave a Reply to Zac Chase