
I recently read James Popham’s Evaluating America’s Teachers: Mission Possible? as a part of a course on designing effective educational measurements. Nerd alert: I really liked the book. It gave tremendous insight into “what’s broken” in our current evaluation system, and some ideas for how we might mend it. Here’s what I learned:
The American educational system (and in particular, the government that funds it) has become hell-bent on using student testing data to evaluate teachers. So, what's causing the furor? It appears that a perfect storm of federal legislation has precipitated a sharpened focus on teacher evaluation. The first element in the storm was the 2009 Race to the Top (RTT) initiative, and more specifically, the heavy purse tied to the improvement of teacher evaluation. If state officials and district administrators could agree to abide by the new rules for teacher evaluation, much needed money would become available. The second element was Secretary of Education Duncan's announcement in 2011 that if state and local educational agencies implemented the more stringent 6-point plan for teacher evaluation, the government would ease sanctions in place from No Child Left Behind (NCLB). Popham describes this as the carrot and stick approach -- the carrot of RTT and the stick of NCLB.
But is the craze to get into compliance with teacher evaluation rules justified? Well, to a "failing" school district, desperate to distance itself from the punitive label from NCLB, or to an underfunded district clamoring for more cash to keep its schools afloat, the furor is absolutely justified. On the other hand, many teachers' unions - like the California Teacher’s Association - are staunchly opposed to agreeing to any legislation that could tie teacher compensation or employment to assessment data. Ceding that control to the federal government is not worth it for many unions - despite the money that could follow.
Many make the counter-argument that the teacher evaluation systems put in place by RTT and Duncan's six-point plan are meant to support teachers and provide professional development. Indeed, Popham presents this perspective. Yet, he also notes that there are reasons that even the most well-intentioned system could be rendered ineffectual. Essentially, the government runs the risk of making a Type I error - identifying bad teachers as good ones (the false positive), or a Type II error - identifying good teachers and labeling them as bad ones (the false negative). Both errors are detrimental, and both stem from four problems with the way the evaluation system currently operates. Chief among these is that districts may use poorly chosen evidence, including test data that is intended to evaluate an entirely different construct (student achievement instead of teacher performance).
So, where do we go awry? Administrators have so much to consider when it comes to how they will evaluate their faculty. Popham exposes the trap of giving ratings - those who rate teachers often make comparative judgments based on their own (biased) experience, and likely provide an inaccurate final score (either over or under-estimating the teacher’s instructional ability). Moreover, Popham suggests that administrators need to be mindful of the purpose of their evaluation: formative or summative. If an evaluator goes into a classroom with both tasks on his/her mind, the dual purpose simply muddies the waters. Stronger feedback happens when the administrator only focuses on providing formative feedback; it is tough to provide good feedback while making summative remarks as well. One of the implications here is time. In order to conduct high-quality teacher ratings, the administrator must have sufficient time to develop or adapt the rating tool, to receive training on how to use the tool, to conduct the observations (if that is part of the tool), and to meet with the teacher being rated. Unfortunately, as Popham concedes, when it comes to "carrying out [an effective] teacher appraisal strategy, you do what you can afford to do."
So, then, what can we do? Popham outlines his rationale for embracing a weighted evidence judgment approach to teacher evaluations. Fo administrators, his points become particularly salient, as administrators are often the ones tasked with evaluating teacher efficacy. His first line of argumentation is to dissuade administrators from trying to "human proof" an evaluation. It is not possible to remove human judgment from the process of evaluating a teacher. Decision making and discernment will be part of the process no matter what. Administrators might as well acknowledge where and how their judgments will take effect. Thus, he enumerates the steps for creating a defensible teacher evaluation - from identifying the criteria to be evaluated, to operationalizing those criteria, to collecting evidence, to creating an aggregate score.
In the noble quest for measuring a teacher's instructional ability, Popham suggests classroom assessments as an important part of the portfolio of measurements used evaluate educators. He raises concern over relying too heavily on standardized assessments, as they are designed as comparative measurements, meaning that if too many students have "mastered" the content in a particular set of questions, those questions would be phased out in subsequent tests. Thus, it is darn near impossible to judge students' content mastery, and by extension, teachers' instructional ability, when the test is a moving (and increasingly more challenging) target. Enter the need for classroom assessments as quality-illuminating evidence that could be used in a teacher-evaluation program. If student growth is a proxy for instructional ability, and explains that pretest and posttest data for units, semesters, or entire school years could demonstrate the extent to which students have grown.
However - and this is a big however - in order to be sure that we are measuring the teacher's influence on student growth, we have to find a way to separate out student growth as a result of school instruction from student growth as a result of other factors. In my own teaching context, this would be a challenge. So many of my students come from families with means, and use those means for tutors, extracurricular educational experiences, immersion trips, foreign study, etc. How can I be sure that a student understands how to write an introduction because of MY instruction, as opposed to the hours of private tutoring he has received (endured?)? Popham concedes that this is a challenge, indeed. All of this may be a LOT to ask of a school administrator in addition to his or her regular responsibilities. So, I return to the title: Mission Possible? Honestly, I’m not sure.
The American educational system (and in particular, the government that funds it) has become hell-bent on using student testing data to evaluate teachers. So, what's causing the furor? It appears that a perfect storm of federal legislation has precipitated a sharpened focus on teacher evaluation. The first element in the storm was the 2009 Race to the Top (RTT) initiative, and more specifically, the heavy purse tied to the improvement of teacher evaluation. If state officials and district administrators could agree to abide by the new rules for teacher evaluation, much needed money would become available. The second element was Secretary of Education Duncan's announcement in 2011 that if state and local educational agencies implemented the more stringent 6-point plan for teacher evaluation, the government would ease sanctions in place from No Child Left Behind (NCLB). Popham describes this as the carrot and stick approach -- the carrot of RTT and the stick of NCLB.
But is the craze to get into compliance with teacher evaluation rules justified? Well, to a "failing" school district, desperate to distance itself from the punitive label from NCLB, or to an underfunded district clamoring for more cash to keep its schools afloat, the furor is absolutely justified. On the other hand, many teachers' unions - like the California Teacher’s Association - are staunchly opposed to agreeing to any legislation that could tie teacher compensation or employment to assessment data. Ceding that control to the federal government is not worth it for many unions - despite the money that could follow.
Many make the counter-argument that the teacher evaluation systems put in place by RTT and Duncan's six-point plan are meant to support teachers and provide professional development. Indeed, Popham presents this perspective. Yet, he also notes that there are reasons that even the most well-intentioned system could be rendered ineffectual. Essentially, the government runs the risk of making a Type I error - identifying bad teachers as good ones (the false positive), or a Type II error - identifying good teachers and labeling them as bad ones (the false negative). Both errors are detrimental, and both stem from four problems with the way the evaluation system currently operates. Chief among these is that districts may use poorly chosen evidence, including test data that is intended to evaluate an entirely different construct (student achievement instead of teacher performance).
So, where do we go awry? Administrators have so much to consider when it comes to how they will evaluate their faculty. Popham exposes the trap of giving ratings - those who rate teachers often make comparative judgments based on their own (biased) experience, and likely provide an inaccurate final score (either over or under-estimating the teacher’s instructional ability). Moreover, Popham suggests that administrators need to be mindful of the purpose of their evaluation: formative or summative. If an evaluator goes into a classroom with both tasks on his/her mind, the dual purpose simply muddies the waters. Stronger feedback happens when the administrator only focuses on providing formative feedback; it is tough to provide good feedback while making summative remarks as well. One of the implications here is time. In order to conduct high-quality teacher ratings, the administrator must have sufficient time to develop or adapt the rating tool, to receive training on how to use the tool, to conduct the observations (if that is part of the tool), and to meet with the teacher being rated. Unfortunately, as Popham concedes, when it comes to "carrying out [an effective] teacher appraisal strategy, you do what you can afford to do."
So, then, what can we do? Popham outlines his rationale for embracing a weighted evidence judgment approach to teacher evaluations. Fo administrators, his points become particularly salient, as administrators are often the ones tasked with evaluating teacher efficacy. His first line of argumentation is to dissuade administrators from trying to "human proof" an evaluation. It is not possible to remove human judgment from the process of evaluating a teacher. Decision making and discernment will be part of the process no matter what. Administrators might as well acknowledge where and how their judgments will take effect. Thus, he enumerates the steps for creating a defensible teacher evaluation - from identifying the criteria to be evaluated, to operationalizing those criteria, to collecting evidence, to creating an aggregate score.
In the noble quest for measuring a teacher's instructional ability, Popham suggests classroom assessments as an important part of the portfolio of measurements used evaluate educators. He raises concern over relying too heavily on standardized assessments, as they are designed as comparative measurements, meaning that if too many students have "mastered" the content in a particular set of questions, those questions would be phased out in subsequent tests. Thus, it is darn near impossible to judge students' content mastery, and by extension, teachers' instructional ability, when the test is a moving (and increasingly more challenging) target. Enter the need for classroom assessments as quality-illuminating evidence that could be used in a teacher-evaluation program. If student growth is a proxy for instructional ability, and explains that pretest and posttest data for units, semesters, or entire school years could demonstrate the extent to which students have grown.
However - and this is a big however - in order to be sure that we are measuring the teacher's influence on student growth, we have to find a way to separate out student growth as a result of school instruction from student growth as a result of other factors. In my own teaching context, this would be a challenge. So many of my students come from families with means, and use those means for tutors, extracurricular educational experiences, immersion trips, foreign study, etc. How can I be sure that a student understands how to write an introduction because of MY instruction, as opposed to the hours of private tutoring he has received (endured?)? Popham concedes that this is a challenge, indeed. All of this may be a LOT to ask of a school administrator in addition to his or her regular responsibilities. So, I return to the title: Mission Possible? Honestly, I’m not sure.