While it seems to make intuitive sense to evaluate teachers based on students’ standardized test scores (aka using ‘value-added measures,’ or VAM), in practice it doesn’t seem to work very well. At this time, researchers do not support the incorporation of student test scores into teacher evaluations except in carefully-designed, low-stakes pilot experiments.
Extreme rating volatility
Rating instability in value-added models is very high, resulting in extreme year-to-year and even multi-year volatility:
- United States Department of Education, Error rates in measuring teacher and school performance using student test score gains: Value-added estimates for teacher-level analyses are subject to a considerable degree of random error when based on the amount of data that are typically used in practice for estimation. If three years of data are used for estimation, more than 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment.
- Di Carlo, The war on error: A recent analysis of VAM scores in New York City shows that the average error margin is plus or minus 30 percentile points. That puts the “true score” (which we can’t know) of a 50th percentile teacher at somewhere between the 20th and 80th percentile – an incredible 60 point spread.
- Economic Policy Institute, Problems with the use of student test scores to evaluate teachers: VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year.
- Baker, You’ve been VAM-ified: Even in the more consistently estimated models, half or more of [New York City] teachers move into or out of the good or bad categories from year to year, between the two years that show the highest correlation in recent years. And this finding still ignores whether other factors may be at play in keeping teachers in certain categories. For example, whether teachers stay labeled as ‘good’ because they continue to work with better students or in better environments.
- Di Carlo, Reign of error: When you’re looking at the single-year teacher estimates (in this case, for 2009-10), the average spread is a pretty striking 46 percentile points in math and 62 in ELA [English-Language Arts]. Furthermore, even with five years of data, the intervals are still quite large – about 30 points in math and 48 in ELA.
- National Education Policy Center, Due diligence and the evaluation of teachers: It is likely that there are a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average) in the L.A. Times’ [value-added] rating system. Only 46% of reading teachers – and only 60% of math teachers – retain the same effectiveness rating [when the model is altered to better account for students’ past performance, peer influence, and other school factors].
- ETS, Using student progress to evaluate teachers: If making causal attributions is the goal, then no statistical model, however complex, and no method of analysis, however sophisticated, can fully compensate for the lack of randomization [in schools]. Other identified problems with VAM include inappropriate attribution, missing data, inappropriate assumptions underlying VAM models, and difficulty in obtaining precise estimates of teacher effects, all of which lead to bias in the data.
- Baker, AIR pollution in New York State?: The measures are neither conceptually nor statistically accurate. They suffer significant bias … And inaccurate measures can’t be fair.
Educational research and policy institutions do not support the use of VAM for teacher evaluation
Because the ratings are so unstable, those organizations that actually look at the peer-reviewed research – and not just ideologically-driven policy advocacy papers – have come out strongly against the use of student test scores for teacher evaluation. These include many of our most respected assessment experts (such as James Popham, Gerald Bracey, and Robert Linn) and educational policy and research institutions such as the National Research Council, the American Educational Research Association, the National Academy of Education, and RAND:
- American Statistical Association, ASA statement on using value-added models for educational assessment: Most VAM studies find that teachers account for about 1% to 14% of the variability in student test scores. VAM scores have large standard errors, even when calculated using several years of data. These large standard errors make rankings unstable, even under the best modeling scenarios. Multiple years of data do not help problems caused when models systematically undervalue teachers who work in specific contexts or with specific types of students.
- ETS, Reliability and validity of inferences about teachers based on student test scores: Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. Scores may be systematically biased for some teachers and against others.
- National Research Council, Value-added methods to assess teachers not ready for use in high-stakes decisions: Too little research has been done on these methods’ validity to base high-stakes decisions about teachers on them. VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.
- American Educational Research Association & National Academy of Education, Getting teacher evaluation right: Value-added models of teacher effectiveness are highly unstable. Teachers’ value-added ratings are significantly affected by differences in the students who are assigned to them, even when models try to control for prior achievement and student demographics. Value-added ratings cannot disentangle the many influences on student progress. Other [teacher evaluation] tools have been found to be more stable. [Using VAM for] high-stakes, individual-level decisions, as well as comparisons across highly dissimilar schools or student populations, should be avoided.
- RAND, Evaluating value-added models for teacher accountability: The research base is currently insufficient for us to recommend the use of VAM for high-stakes decisions. In particular, the likely biases from the factors we discussed … are unknown, and there are no existing methods to account for either the bias or the uncertainty that the possibility of bias presents for estimates. Furthermore, the variability due to sampling error of individual teacher-effect estimates depends on a number of factors — including class sizes and the number of years of test-score data available for each teacher — and is likely to be relatively large. Similarly, rankings of teachers should be avoided because of lack of stability of estimated rankings.
- Annenberg Institute for School Reform, Brown University, Can teachers be evaluated by their students’ test scores?: In the abstract, value-added assessment of teacher effectiveness has great potential to improve instruction and, ultimately, student achievement. The notion that a statistical model might be able to isolate each teacher’s unique contribution to their students’ educational outcomes – and by extension, their life chances – is a powerful one. However, the promise that value-added systems can provide such a precise, meaningful, and comprehensive picture is not supported by the data. Annual value-added estimates are highly variable from year to year, and, in practice, many teachers cannot be statistically distinguished from the majority of their peers. Persistently exceptional or failing teachers – say, those in the top or bottom 5 percent – may be successfully identified through value-added scores, but it seems unlikely that school leaders would not already be aware of these teachers’ persistent successes or failures.
- Popham, Teacher evaluation pitfalls: Despite the current clamor to evaluate teachers’ effectiveness on the basis of their students’ test scores, no evidence currently exists to show that the tests intended for use in such evaluations are up to the job. Put simply, there is no proof – none at all – that these tests can accurately distinguish between welltaught and badly taught students.
- National Education Policy Center, Review of two culminating reports from the MET project: Randomization was significantly compromised, and participating teachers were not representative of teachers as a whole. [Regarding] how best to combine value-added scores, classroom observations, and student surveys in teacher evaluations, the data do not support the MET project’s premise that all three primarily reflect a single general teaching factor, nor do the data support the project’s conclusion that the three should be given roughly equal weight. . . . Evaluating teachers requires judgments . . . that are not much informed by the MET’s masses of data. While the MET project has brought unprecedented vigor to teacher evaluation research, its results . . . offer little guidance about how to design real-world teacher evaluation systems.
- Bracey, What’s the value of growth measures?: [VAM] cannot permit causal inferences about individual teachers. At best, it is a first step toward identifying teachers who might need additional professional development or low performing schools in need of technical assistance.
Predictable, harmful results from the use of VAM
Despite researchers’ and statisticians’ strong recommendations against doing so, some states have forged ahead with ‘value-added’ teacher evaluation systems anyway. Ignoring numerous warnings to the contrary has resulted in predictable, harmful outcomes. For example:
- Washington Post, A ‘value-added’ travesty for an award-winning teacher: Teacher of Year rated unsatisfactory under VAM system.
- Orlando Sentinel, Teacher evaluation process: unsatisfactory: One of nation’s top high schools, with highest FCAT scores in high-achieving Seminole County, rated as ‘needs improvement’ under state’s teacher evaluation system.
- Amrein-Beardsley, et al., Value-added model research for educational policy: Policymakers throughout the country are increasingly embedding score-based (VAM) approaches within educational evaluation and accountability systems. On the other hand, social science researchers are increasingly questioning the methodological, technical, and inferential attributes of these same VAM approaches. . . . Policymakers have come to accept VAM as an objective, reliable, and valid measure of teacher quality. At the same time, [they ignore] the technical and methodological issues.
Until the measures are more stable, policymakers should note that the legality of VAM is very much in question:
- Baker, Oluwole, & Green, The legal consequences of mandating high-stakes decisions based on low quality information: Student growth percentile measures being adopted by states for use in teacher evaluation are, on their face, invalid for this particular purpose. . . . [and] are likely to open the floodgates to new litigation over teacher due process rights. This is likely despite the fact that much of the policy impetus behind these new evaluation systems is the reduction of legal hassles involved in terminating ineffective teachers.
- Pullin, Legal issues in the use of student test scores and value-added models to determine educational quality: If VAM is used for high-stakes consequences like salary differentiation, termination, or damage to professional reputation, the potential for successful legal challenge is high. Given the scientific issues associated with VAM methodologies, it is possible that the use of VAM to make a high-stakes decision about an educator would not even survive a rational basis review under Equal Protection analysis.
- NPR, Teachers union files federal lawsuit challenging Florida teacher evaluations: Current teacher evaluation system violates equal protection and due process rights of teachers.
Using VAM as one of ‘multiple measures’
Some VAM advocates (such as StudentsFirst and the Gates Foundation) have proposed using student test scores as just one of ‘multiple measures’ to evaluate teachers, along with student surveys, administrator observations, professional portfolios, and other factors. Unfortunately, the instability of the test score component still means that a significant percentage of teachers’ evaluations is highly volatile. Do we ask doctors, lawyers, and other professionals to adopt systems in which a large percentage of their evaluation is based on a component that has been shown repeatedly by researchers to be statistically invalid, operationally unreliable, and disproportionately impactful? It’s like asking them to eat an ice cream sundae with two scoops of ice cream and one scoop of horse droppings. Even though it’s only one part of many, we’re still asking them to eat manure…
Another issue worth noting is that even if teacher effects could be teased out, decades of peer-reviewed research show that teachers only account for about 10% of overall student achievement (give or take a few percentage points). Another 10% or so is attributable to other school factors such as leadership, resources, and peer influences. The remaining 80% of overall student achievement is attributable to non-school factors such as individual, family, and neighborhood characteristics. A few exceptional ‘beating the odds’ schools aside, these ratios have remained fairly stable (i.e., within a few percentage points) since they were first noted by the famous Coleman Report of the 1960s. Given the overwhelming percentage of student learning outcomes that is attributable to non-teacher factors, it is neither ethical nor legally-defensible to base teacher evaluations on factors outside of their control.
Using VAM as a screening measure
Right now, the best way to use VAM appears to be as a screening mechanism, much like in medicine. Screening procedures used by doctors often have high error rates so they simply are used to identify patients who warrant further investigation. As Douglas Harris, endowed chair at Tulane University and author of Value-Added Measures in Education, explains:
[In medicine,] those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate. They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.
Ineffective teachers could be identified the same way.
Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.
Legislation or policies that advocate for the inclusion of student test scores as part of teacher evaluation will have to somehow overcome the significant limitations outlined above in order to be both ethically and legally defensible. In particular, the rating volatility that results in large percentages of teachers bouncing from year to year between excellent, average, and unsatisfactory categories must be drastically reduced. Standardized test scores that purport to be fair, objective, valid, and reliable for student learning purposes appear to be much less so when it comes to evaluating teachers’ contributions to that learning. The fact that these technical, methodological, statistical, and implementation challenges still loom large after nearly two decades of work underscores the difficulty of the task. At this point, ‘value-added’ teacher evaluation is an idea that makes sense in theory but remains unworkable in practice. As such, no state should be incorporating student test scores into teacher evaluations in anything other than carefully-designed, low stakes pilot experiments.
Other resources that may be helpful
- Petrilli, All or nothing on teacher accountability (teacher improvement v. teacher accountability)
- Amrein-Beardsley, Why VAM is a sham (top 10 reasons VAM doesn’t work)
- New York Times, Confessions of a ‘bad’ teacher
- Baker, On misrepresenting (Gates) MET to advance state policy agendas
- Amrein-Beardsley, Methodological concerns about [Tennessee’s] education value-added assessment system
- Baker, The toxic trifecta, bad measurement, and evolving teacher evaluation policies
- Baker, Gates still doesn’t get it! Trapped in a world of circular reasoning and flawed frameworks