High-leverage teacher evaluation practices for instructional improvement

This study's purpose is to extend our understanding of school leadership for student learning by identifying high-leverage teacher evaluation practices that improve teaching. A partnership with a state education agency administered a teacher questionnaire regarding evaluation practices multiple times in one semester, then linked teacher responses to their next within-semester observation score. Broadly, teachers reported on scoring practices, the facilitation of post-observation conferences, feedback characteristics, and post-conference supports for evaluation-informed professional learning. Fixed effect regressions effectively compare observation scores and teacher-reported evaluation practices within the same teacher or teacher-by-evaluator pairing over four months while controlling for month-to-month influences on performance. The methods remove several serious confounders plausibly affecting related estimates in prior work. The analysis identifies six high-leverage teacher-reported evaluation practices, most of which apply to post-conference practices linking evaluation to professional learning. The evidence refines the academic understanding of leadership for student learning and implies that leadership preparation and in-service programs might emphasize the six high-leverage evaluation practices to promote active use among practicing leaders. Policymakers might ensure that aspiring and in-service leaders can develop these practices and that there are strong links between teacher evaluation and professional learning systems for school leaders to use.


Introduction
Over the last 15 years, education agencies and ministries worldwide reformed teacher evaluation significantly (Hunter, 2021;Lavigne, 2018).Despite differences across evaluation systems, many revised teacher tenure policies, adopted new standards-based performance rubrics and multiple teacher performance measures, increased the frequency of classroom observations and observation conferences, and some purposefully linked evaluation and professional development systems (Donaldson, 2021).Ultimately, these reforms were to improve student outcomes via evaluation for accountability, primarily through teacher dismissal and recruitment, and development or supervision, which aimed to strengthen teachers' instructional practices. 1Aside from the rare instances of automatically triggered policy responses (e.g.consistently low performance by teachers in Washington DC's IMPACT system resulted in automatic dismissal; Dee and Wyckoff, 2015), the implementation of nearly all evaluation reforms fell on the shoulders of assistant and head principals (Hunter and Rodriguez, 2021;Kraft and Gilmour, 2016).Indeed, these reforms significantly changed how school leaders spent their time and implemented instructional leadership practices (Hunter and Rodriguez, 2021;Kraft and Gilmour, 2016).
Notwithstanding these substantial changes to school leadership, the field lacks rigorously tested evidence-based "high-leverage" teacher evaluation practices, a set of specific leadership practices that significantly improve teaching and student outcomes.While practitioners, theory, and descriptive research implicitly point to some potentially high-leverage practices, little, if any, rigorous research links specific practices to teaching or student outcomes.For example, practitioners ostensibly believe that frequent observations, accurate performance ratings, and effective postobservation feedback may be high-leverage practices as evaluator training (Kraft and Gilmour, 2016) emphasizes such skills (Hunter, 2020;Kraft and Christian, 2022;Steinberg and Sartain, 2015).Theories suggest that evaluators should praise and encourage teachers, including those who need to improve, involve teachers in making sense of performance measures and designing teacher improvement plans, and encourage teacher self-reflection (Donaldson, 2021).
A growing body of empirical work documents several school leadership teacher evaluation practices; moreover, some quantitative work links these practices to performance measures though most, if not all, of this work was not conducted within recently reformed evaluation contexts or did not control for plausibly significant confounders. 2Recent qualitative work reveals that recently implemented school leadership evaluation practices include goal-setting (Atteberry and LaCour, 2021), classroom observations, providing relevant professional learning opportunities (Ovando and Ramirez, 2007), and resolving conflicts during contentious conferences (Conrad and Hackmann, 2020).Hattie and Timperley's (2016) influential review of teacher-provided feedback to students concludes that student learning improves when the received feedback identifies goals and improvement strategies, implying that the evaluation practice of providing similar feedback might improve teacher performance.Related work in teacher preparation settings offers similar implications and suggests that the timing of provided feedback affects teacher-candidate performance (Scheeler et al., 2004).Finally, the works most similar to the current study link specific types of feedback to teacher-reported feedback use qualitatively (Kimball, 2003;Tuytens and Devos, 2016) and quantitatively (Devos et al., 2014;Tuytens and Devos, 2011).Additionally, Grissom and colleagues (2018) linked school leader observation scores, similar to the teacher observation scores used by the analysis herein, to teacher and student performance measures.However, the leader observation scores in Grissom et al. (2018) were generic and could not differentiate between leadership practices for teacher evaluation and any other observed leadership practices.
The current study extends our understanding of school leadership evaluation practices that improve teaching.It does so by estimating the relationships between teacher-reported evaluation practices (TREPs) and formal teacher performance measures in a current evaluation context while controlling for plausibly significant confounders.Notably, the study uses unique teacher questionnaire and performance data collected by a researcher-practitioner partnership between the Tennessee Department of Education (TDOE; i.e. practitioners) and the Tennessee Education Research Alliance (i.e.researchers).TDOE was interested in exploring some specific, potentially high-leverage teacher-reported evaluation practices (TREPs) school leaders might implement between a classroom observation and observation-informed teacher improvement.TREPs and teacher performance measures were collected multiple times within a single semester, effectively controlling for plausibly significant confounders that vary over longer periods.The study answers the following research questions: 1. To what extent is teacher performance associated with specific teacher-reported school leadership evaluation practices?2. To what extent are the estimated associations sensitive to differences between specific teacher-by-evaluator pairings and differences over time?
The study makes two significant contributions to research, policy, and practice.First, it extends our understanding of Hallinger's leadership for learning theory (2011), which asserts that school leadership practices affect teaching and, ultimately, student learning.Second, it suggests which evaluation practices leadership preparation and in-service programs might emphasize as high-leverage in improving teaching.

Conceptual framework
Hallinger's ( 2011) leadership for learning model asserts that the school leader, teacher, and contextual characteristics affect school leadership practices that affect student learning.Though conceived broadly, leadership for learning may apply to teacher evaluation practices as most recent teacher evaluation reforms have fallen on school leaders' shoulders (Hunter and Rodriguez, 2021;Kraft and Gilmour, 2016).Indeed, in the current study's context, 75% of evaluators are assistants or head principals.Furthermore, a growing body of evidence is consistent with Hallinger's model as recent work concludes that evaluator (i.e.school leader), teacher, and contextual characteristics affect evaluation practices (Hunter and Ege, 2021;Hunter and Rodriguez 2021;Hunter and Springer, 2022).A smaller literature implies that teacher evaluation as implemented by school leaders can improve student learning under certain conditions via improvements in teaching as measured by observation scores (Hunter and Bowser, 2021;Phipps and Wiseman, 2021;Steinberg and Sartain, 2015;Taylor and Tyler, 2012).Finally, emerging work concludes that teacher observation scores capture meaningful information about teacher effectiveness and subsequent student outcomes; students taught by teachers with higher observation scores, indeed, higher scores from the same context as the study herein, experience better short-and long-term academic and non-academic outcomes (Doan, 2019).Ultimately, the leadership for learning model and prior empirical work suggest that teacher and evaluator characteristics affect the leadership practice of teacher evaluation and that evaluation practices improving teacher observation scores may improve student academic and nonacademic outcomes.

Related work
Although teacher evaluation is more encompassing than teacher observation (e.g. the former includes personnel decision-making), this review considers studies applying to the broader topic of evaluation.The review also examines work beyond K-12 settings as the fields of psychology and management have far more established theoretical and empirical bodies of work than research limited to K-12 settings.By reviewing research from other settings, this study does not assume that Hunter: High-leverage teacher evaluation practices for instructional improvement relationships in those settings necessarily apply to K-12 settings.Instead, such theoretical and empirical precedents help frame and interpret the current analysis.
The literature discusses various theoretically effective evaluation practices that might be implemented by school leaders, far more than examined by the current study.This study focuses on the theoretically high-leverage evaluation practices school leaders might implement that were also of interest to TDOE.TDOE and prior work suggested that the practices evaluators implemented from the beginning of a classroom observation through managing observation-informed teacher improvement efforts were the crux of evaluation for development (Donaldson, 2021).More specifically, the study examines practices associated with (a) conducting observations, (b) teacher engagement during post-observation conferences, (c) providing performance-enhancing feedback, and (d) facilitating post-conference teacher development.The researcher-practitioner partnership developed a questionnaire asking teachers about foci (a)-(d) (see below for details on questionnaire design and items).
Prior work implies that teachers who believe their observation scores are inaccurate or unfair are less likely to act on observation-informed improvement plans, stunting development (Jawahar, 2007(Jawahar, , 2010)).The current study asked teachers if they are observed frequently enough for their observers to score them accurately (Frequency) and if their observation scores reflect their performance (ScoresRefl; see Table 1 for an item).Predominantly theoretical work argues that practices encouraging teacher engagement during post-observation conferences also promote development (Donaldson, 2021;Glickman, 2002).Thus, the study collected teacher reports about the extent to which they felt comfortable sharing mistakes with their observer during post-observation conferences (ShareMistakes), and their observer encouraged them to reflect on their practice (Reflection; see Table 1 for questionnaire items).
A vast corpus examines the characteristics of performance-enhancing feedback, though little, if any, of this work attempts to remove concerning sources of bias, hampering causal inference.Prior studies conclude that feedback that is more specific (Hill and Grossman, 2013), references evidence from the observation, and that recipients perceive as useful is more likely to improve performance and student outcomes (Donaldson, 2021;Kinicki et al., 2004;Tuytens and Devos, 2016).However, one recent study using written feedback provided to teachers that researchers coded finds that receiving higher dosages of evidence-referencing feedback is not associated with teacher improvement (Hunter and Springer, 2022).The study herein asked teachers about the degree of feedback specificity (Specificity), the extent to which evaluators referenced evidence during the postobservation conference (Evidence), and student work specifically (StdtWrk), and the usefulness of feedback to improve instruction (ImpInst) and student learning (ImpLrning; see Table 2 for items).
Additionally, the questionnaire asked teachers about the length of the post-observation conference (PostLength; see Table 2 for the item).Although I am unaware of any study linking postobservation conference length to subsequent performance, conferences must be long enough for  evaluators and teachers to hold productive conversations.Furthermore, prior work implicitly assumes that conference length is vital as they criticize evaluators who systematically hold brief post-observation conferences (Hunter and Rodriguez, 2021;Kraft and Gilmour, 2016).Finally, teachers reported on the extent to which they had access to post-conference resources for development.Specifically, the questionnaire asked teachers if they had access to someone in their school with expertise who could help them improve based on the post-observation feedback (AccExp) and had time to meet with that person (TimExp).Teachers were also asked if they had access to relevant professional development opportunities (AccPD) and had time to plan how to act on their improvement plans (TimePlan).Although I am unaware of empirical studies linking evaluation-informed professional learning resources to performance, practitioner and academic works argue that such links are essential (Archer et al., 2016;Donaldson, 2021;Weisberg et al., 2009; see Table 1 for items).

Context: Tennessee teacher evaluation system
Tennessee districts adopt the default TDOE-design teacher evaluation system, called the Tennessee Educator Acceleration Model (TEAM), or design their own.All participating districts adopted the TEAM system.

Observation rubric
Evaluators use the TEAM rubric, which is based on Danielson's ubiquitous Framework for Teaching, to assess teacher performance.The TEAM rubric is used to assess student and teacher behaviors and interactions between them (see Appendix A).The rubric includes 19 performance indicators (e.g.Lesson Structure and Pacing, Questioning) used for classroom observation and four more indicators regarding teacher professionalism, which describe teacher behaviors beyond the classroom.As the current study focuses on evaluator assessments of teacher classroom performance, it omits the four professionalism indicators.TEAM scores range from 1 = Below Expectations to 5 = Above Expectations.Analyses find that higher performance on the TEAM rubric is associated with higher student achievement scores and better non-academic outcomes (Daley and Kim, 2010;Doan, 2019).

Evaluator preparation
TDOE facilitates two-day evaluator preparation academies each summer, focusing on accurate scoring, facilitation of observation conferences, teacher development, and evaluation-related policies.Prospective evaluators must pass a certification exam following the academy before observing or evaluating teachers officially (Alexander, 2016).Certification exams assess scoring accuracy via video-recorded lessons and prospective evaluator knowledge of evaluation implementation and Tennessee policy.Returning evaluators may test out of the two-day training if they receive a high enough score on the annual certification exam.

Teacher performance scoring procedures
The Tennessee Board of Education (TBOE) assigns teachers some number of evaluations based on their prior-year composite effectiveness score and experience.TEAM observation scores primarily determine teacher composite level of effectiveness (LOE) scores, followed by student outcomes (e.g.student achievement, department-or school-wide student achievement).TBOE assigns teachers with the lowest LOE score four observations and those with the highest one.Moderately effective teachers are assigned four or two observations depending on their years of experience (for details see Hunter, 2020).Districts may add to these assignments but cannot assign fewer.Teachers report that observations last approximately 30 min each (Hunter, 2020).
Within one week of each observation, TBOE expects teachers to engage in a structured feedback conference with their evaluator.The evaluator and teacher are to discuss opportunities for improvement based on the observation and actionable improvement strategies.Teachers needing improvement suggestions or support beyond the feedback provided during the post-observation conference should be pointed toward suitable professional learning opportunities.Theoretically, feedback conferences end with a recap of improvement goals and timelines for improvement.

Data
This researcher-practitioner partnership uses regularly collected administrative data and unique questionnaire data from five rural Tennessee districts.TDOE recruited participating districts concerning four broad TDOE-identified evaluation implementation foci.The first concerned school work environments and teacher and school administrator perceptions concerning teacher evaluation implementation as reported on nine questionnaire items from the statewide-administered 2017-2018 Tennessee Educator Survey (see Appendix B).District-level average teacher 2017-2018 value-added (i.e.TVAAS) and LOE scores, two de facto Tennessee teacher performance measures, informed the second broadly conceived recruitment criteria.Third, TDOE recruited districts based on "misaligned" TEAM and TVAAS scores.During the study period, Tennessee policy asserted that teachers' TEAM and TVAAS scores should align; both range from 1 to 5. TBOE defined alignment as the absolute value of the difference between the two scores (Alexander, 2016).Finally, TDOE was concerned about "non-differentiated" observation scores.Per TDOE, evaluators were non-differentiating if 90% or more of the issued scores were assigned to the same performance level (e.g. 3).TDOE effectively assumed that performance varies within and between teachers and that observation scores should reflect such variation.Although the four recruitment criteria are somewhat technical, TDOE did not apply them formulaically or rigidly.Instead, TDOE reviewed the four characteristics informally, then recruited five districts, all of which joined the study.

Questionnaire design and administrative data
The researcher-practitioner partnership designed a unique questionnaire to capture relatively shortterm variation in TREPs.The partnership developed 14 items, with most using a six-point Strongly Disagree to Strongly Agree scale (Tables 1 and 2).However, Specificity response options ranged from Generic = 1 to Lesson-Specific = 5, StdtWork responses were Yes = 1 or No = 0, and conference length options included ranges of minutes (e.g.5-10 min; Table 2).
Although prior studies have used similar items (e.g.Cherasaro et al., 2016;Jawahar, 2010), the current study's questionnaire design was unique as teachers were invited to respond to multiple questionnaires within a single semester.At the end of each month, from February 2018 through April 2018, TDOE emailed teachers in participating districts to see if they received a formal observation within the prior month.If affirmative, the questionnaire invited teachers to continue answering questions concerning their recent observations.The design aimed to collect a teacher's first Hunter: High-leverage teacher evaluation practices for instructional improvement questionnaire submission after her first Spring 2018 observation and, more generally, her kth submission after her kth Spring 2018 observation.Thus, one teacher might have submitted their first questionnaire in April after receiving their first and only Spring 2018 observation in March, while a second teacher might have submitted her second questionnaire in April after receiving her third Spring 2018 observation in early April.
Questionnaire data are linked with TDOE administrative data, including teacher race, gender, experience, education level, and prior-year LOE scores.The administrative data are also unique in that they include observation-level dates and teacher performance scores as measured by the TEAM rubric.Thus, a teacher's kth questionnaire submission can be linked to her next observation score.For example, suppose a teacher received an observation in January and submitted her first questionnaire that same month.Then, in March, she received another observation and submitted her second questionnaire at the end of the month.In this case, the teacher's January evaluation implementation report is linked to her next observation score from March.

Methods
The methodological goal is to estimate associations between teacher evaluation implementation processes reported by teachers (TREPs) and teacher performance while controlling for significant sources of bias.Estimates are biased if an omitted variable significantly correlates with a TREP and observation scores; omitted variables that correlate with TREPs or observation scores do not introduce bias.As the data are non-experimental, the regressions may not recover genuine causal estimates.Nonetheless, the unique questionnaire design and administrative data permit quasi-experimental methods that plausibly remove significant sources of bias; thus, the estimates might approximate causal effects.
In terms of internal validity, or the extent to which the analytical methods warrant causal inferences, the most important research design element is that all data were collected within four months within one semester.If this study collected data annually, then regressed observation scores on a TREP, year-to-year differences in classroom assignments, teacher or evaluator ability, or other year-to-year factors might introduce bias.A research design could overcome year-to-year threats by collecting data semesterly, which would represent a substantial methodological improvement.However, such a design remains susceptible to semesterly confounders, such as new class assignments for high school teachers.The current study overcomes these threats as it only uses withinsemester variation due to the data collection design.
Although the data collection plausibly removes significant time-varying confounding variation, it does not control for confounding differences between teachers or evaluators.As implied by the leadership for learning model and prior empirical work, unobserved teacher characteristics affect evaluation practices and teacher performance (Hallinger, 2011;Hunter and Ege, 2021).Thus, the analysis applies teacher fixed effects via equation (1), effectively comparing TREPs and observation scores within the same teacher over 4 months.
where y ik is the ith teacher's next observation score after their kth questionnaire submission, TREI ik is one of the 14 TREPs, θ i is the teacher fixed effect, e ik is the error term, and standard errors are clustered at the teacher level. 3A separate model is estimated for each TREP and each coefficient δ represents the change in y ik that is associated with a one-unit increase on the TREP item scale, where a unit increase reflects better-implemented evaluation practices as reported by teachers.Equation (1) cannot include measures of teacher characteristics (e.g.race/ ethnicity, prior-year LOE) collected annually because such measures do not vary within teachers over four months.Moreover, θ i controls for such variables and all unobserved differences between teachers that do not vary within four months.
Although the analytical sample includes teachers who submitted up to two questionnaires, δ is effectively estimated using data from the teachers with exactly two questionnaire submissions and who received an observation after each submission. 4Teachers with exactly one submission have no within-teacher variation in any TREPs to contribute to δ.Similarly, among teachers who submitted two questionnaires but received a total of one within-semester observation, there is no withinteacher variation in observation scores to estimate δ.While these "singleton" teachers do not contribute to δ directly, they effectively increase statistical power, implying that any undetected statistical relationships are not due to low power.
The research design and equation ( 1) plausibly control for several significant confounders, yet two more may remain.First, schools with multiple evaluators might assign who observes which teachers in confounding ways (e.g.evaluators might motivate improvement among low-performing teachers who receive better implementation independent of TREP, introducing positive bias).The study tests the sensitivity of δ to specific evaluator and teacher pairings using equation ( 2): where η ji is a vector of dummy variables for each pairing between teacher i and evaluator j, the evaluator who implemented the observation that questionnaire submission k is based on.Equation ( 2) effectively compares observation scores and TREPs within specific teacher-evaluator pairings during the four-month study.
Finally, the sensitivity of estimates is subjected to month-fixed effects using equation (3): where τ m controls for month fixed effects in month m, the month when submission k was submitted.
Month fixed effects control for significant within-semester omitted variables affecting TREP and teacher performance month-to-month.As not all teachers necessarily submitted their first (or second) questionnaire in the same month, differences across months might bias δ if they correlate with both TREPs and teacher performance.For example, emerging work finds that teacher performance as measured by TEAM observation scores improves within a year and this improvement may occur independent of evaluation practices (Hunter and Steinberg, 2022).However, if school leaders also improve their evaluation practices within a year, equations ( 1) and ( 2) would attribute improvements in teacher observation scores to higher TREPs, positively biasing δ.Month-fixed effects control for such confounders as they remove month-to-month "shocks" to TREPs and teacher performance.Equations ( 2) and (3) cluster standard errors at the teacher-by-evaluator level.

Descriptive statistics
Responses show that teachers agree that their evaluators implement the examined evaluation practices, though there is response variation (Tables 1 and 2).Table 1 and the bottom panel of Table 2 display descriptive statistics for items using the Strong Disagree (=1) to Strongly Agree = 6) response scale.For these items, Agree = 4 is the modal and median response and the interquartile range (IQR, 25th to 75th percentile) is 1.The top panels of Table 2 list statistics for TREPs using different scales.Seventy-seven percent of conferences included reviews of student work (StdtWork, Table 2).The median and modal response concerning feedback specificity (Specificity) was 3, the middle response option, with an IQR of 2. Finally, the modal response about post-observation conference length was 11-20 min (PostLength).Before using PostLength responses in equations ( 1) to (3), responses were converted to the median minutes within each bin.For example, the response of 11-20 min was converted to 10.5 min.After this conversion, the mean post-observation conference lasted 11.78 min with a 4.69 min SD.The number of responses to each TREP item varied, ranging from 304 to 336.
A questionnaire submission must be linked to a teacher's next within-semester observation score for use in equations ( 1) to (3); otherwise, either the questionnaire submission or outcome would be missing and unusable.These linked records come from 37 schools and 177 teachers who were evaluated by a total of 80 evaluators, yielding 258 unique teacher-by-evaluator pairings (Table 3).Among the 177 teachers, 96 submitted one questionnaire, and 81 submitted two (Table 3); thus, TREP responses and observation scores from these 81 teachers drive the estimation of δ.The mean response is associated with a teacher whose prior-year LOE is 360, placing them "At Expectations" per TDOE, has 10 years of experience, does not hold a Master's, and is a White female (Table 3).The average post-questionnaire-submission observation score is 3.77, with a 0.61SD.

Main findings
Several examined TREPs affect teacher performance positively (Table 4, column I).A unit increase on the agreement scale concerning the extent to which teachers report that their observation scores reflect their performance (ScoresReflect) is associated with an observation score improvement of 0.07 (0.11SD).Similarly, a unit increase in the extent to which the evaluator encourages teacher self-reflection during the post-observation conference (Reflection) is associated with observation score improvements of 0.04 (0.07SD).Teachers who report spending an additional minute in their post-observation conference (PostLength) also experienced higher observation scores of 0.08 units (0.13SD).As the typical post-observation conference lasted about 12 min, a one-minute extension increases the length of conferences by approximately 8%.Three of four post-conference TREPs have positive statistically and practically significant associations with observation scores (Table 4, column I).A unit increase in the extent to which teachers  agree that they can access relevant in-school expertise (AccExp), have time to engage with that expertise (TimeExp), and have access to relevant professional development (AccPD) are all associated with an observation score improvement of about 0.03 (0.05SD).The extent to which teachers believed their evaluators were in their class frequently enough to warrant accurate scores (Frequency) was the only TREP with a statistically significant negative association (Table 4, column I).None of the positive associations in column I were sensitive to the application of teacher-by-evaluator fixed effects (column II) or month fixed effects (column III).Indeed, none of the statistically significant positive coefficients in the original model changed after controlling for differences between specific teacher-by-evaluator pairings (column II) or monthly shocks (column III).However, the significant negative relationship with Frequency in column I becomes nonsignificant after accounting for differences between teacher-by-evaluator pairs and month-fixed effects.Additionally, a previously nonsignificant association gains statistical significance after controlling for teacher-by-evaluator fixed effects.A unit increase in the extent to which teachers report that their post-observation conference feedback is specific (Specific) is associated with a decline in observation scores of 0.19 units (0.31SD), the largest association detected (columns II, III). 5

Conclusions
Notwithstanding the proliferation of teacher evaluation reforms around the globe to improve student outcomes and consequent changes to school leadership (Hunter, 2021;Hunter and Rodriguez, 2021;Lavigne, 2018), we know little about which specific evaluation practices improve teaching and student learning.Stated differently, the field has yet to identify high-leverage teacher evaluation practices.The current study addressed this gap by leveraging a unique research design, linking 14 teacher-reported school leadership evaluation practices (TREPs) to teacher performance.Specifically, the analysis overcame several threats to causal inference plausibly affecting prior work by comparing teacher observation scores and TREPs within the same teacher over four months and tested the sensitivity of estimates to specific teacher-by-evaluator pairings and month-to-month shocks.
Of the 14 TREPs examined, six may represent high-leverage practices for teaching improvement.The evidence suggests that six TREPs are associated with teaching improvements and that half of these concern how evaluators (most of whom are school leaders) link evaluations to teacher professional learning opportunities.The data suggest that the evaluation practices of (a) scoring teachers in ways that reflect their performance (ScoresRefl), (b) encouraging teacher self-reflection during postobservation conferences (Reflection), providing teachers access to (c) professional development and (d) in-school expertise relevant to their improvement area (AccPD, AccExp) and (e) time to engage with those experts (TimeExp) represent high-leverage practices.Additionally, the evidence implies that the practice of holding (f) longer post-observation conferences (PostLength) may improve teaching.That accurate scoring (a) and longer conferences (f) represent high-leverage evaluation practices is consistent with prior work (Jawahar, 2010;Kraft and Gilmour, 2016); however, empirical research linking practices (b)-(e) to teacher and student outcomes is less common (Donaldson, 2021).Indeed, prior work regarding practices (b)-(e) has primarily been theoretical or limited to associations with intermediate outcomes, such as teacher-reported engagement in professional learning activities to improve teaching and student outcomes (Donaldson, 2021).Papay and colleagues' (2020) experimental study in Tennessee is an exception and represents some of the best evidence available concerning links between evaluation and teacher professional learning.
School leaders in Papay's study (2020) paired a teacher with a low observation score in a specific teaching domain with an in-school colleague with a substantially higher score in the same domain; school leaders also provided time for the teacher pairs to meet for developmental purposes.The effects were astounding: observation-informed pairings caused teacher observation scores and student achievement scores to improve dramatically.These results are consistent with the highleverage evaluation practices identified by the current study.Theoretically, such teacher pairings depend on (a) accurately rated teacher performance.In addition, during (f) sufficiently long postobservation conferences, evaluator (b) encouragement for teacher self-reflection primes teachers for productive professional learning opportunities.However, the success of the teacher-pairing intervention also depends on the lower-performing teacher having access to (c) in-school expertise and (d) professional development (i.e. the teacher pairing) and the time to engage with the teacher partner (e).While evaluation practices (a)-(f) might explain the success of these teacher pairings, practices (a)-(f) are not limited to Papay's intervention.Indeed, practices (a)-(f) could manifest in the context of evaluation for development that leads to instructional coaching, broader forms of peer mentoring, or other professional learning opportunities.
The evidence also suggests that the other eight TREPs examined are not associated with teaching improvement, and one, feedback specificity, may lower teacher performance.Most of these TREPs concerned feedback practices, though an immense corpus of interdisciplinary work argues that effective feedback is critical to evaluation for development (Donaldson, 2021;Hattie and Timperley, 2016;Kluger and DeNisi, 1996;Kraft and Christian, 2022).Despite implications from earlier work, the lack of associations with feedback characteristics is consistent with a recently published educational study linking researcher-coded feedback characteristics to teacher performance (Hunter and Springer, 2022).Indeed, Hunter and Springer (2022) also found negative associations between feedback characteristics and Tennessee teacher observation scores, which the authors argued might reflect negative selection bias.Evaluators might issue the most specific feedback to the least effective teachers, negatively biasing estimates; however, this explanation is less plausible, though not impossible, for the findings herein.The current study controls for unobserved differences between teachers via fixed effects and, presumably, teacher effectiveness does not change appreciably within four months.If this study's estimates are not negatively biased, then the negative association between teacher performance and feedback specificity might be due to teacher reactions to specific feedback if it is critical.Too-specific critical feedback might decrease teacher efficacy, introduce or affirm negative self-images, or lead teachers to believe that their performance was so bad they are incapable of reaching improvement goals, dampening their motivation to improve (Carver and Scheier, 1982;Donaldson, 2021).However, this explanation is speculative and deserves further academic attention.

Limitations
This study may suffer from four broad limitations.First, the study may not capture the causal effects of TREPs on teacher observation scores, despite its analytical strengths.Although the coefficients may capture the least-biased estimated relationships between TREPs and teacher performance, this does not mean they are causal.Indeed, the substantial negative relationship with feedback specificity may imply the presence of confounders.While it may be nigh impossible to assign teachers to evaluators with qualitatively different evaluation practices randomly, research may still be able to recover genuinely causal estimates.For example, with a large enough evaluator pool, researchers could randomly assign evaluators to training on specific evaluation skills and then collect TREPs and subsequent teacher performance measures.The researchers could then use variation in the random assignment as an instrument to predict exogenous variation in TREPs, capturing causal effects.
Second, even if the current study recovers causal effects, its generalizability may be limited in two ways.The coefficients were effectively based on data from 81 teachers who volunteered to participate in the study.The estimated relationships may not hold in other settings if reported evaluation practices or teacher performance correlate with the desire to participate in research studies.Additionally, the desire to strengthen internal validity by limiting the study to four months means the data do not capture any longer-term effects of the evaluation practices examined.Future research could feasibly collect within-semester data from larger teacher samples, allowing analysts to test the findings' generalizability to other settings.New studies could also collect data over longer periods to explore the longer-term effects of specific evaluation practices on teacher performance.
Third, the number of TREPs investigated was limited, though doing so was critical in maintaining a collaborative researcher-practitioner partnership.Future work might explore other potentially highleverage TREPs.Relatedly, using TREPs on questionnaire-based agreement scales instead of researcher-coded school leadership evaluation practices may limit the study's inferences.Ideally, the study would estimate the causal effect of genuinely and qualitatively different school leadership evaluation practices, which may differ from TREPs.Researchers might use standards-based school leadership performance rubrics, akin to those teacher observation scores are based on, to assess the quality of specific evaluation practices; then, analysts could link evaluation practice measures to teacher or student performance measures.Although Grissom et al. (2018) attempted such a study, their measure of leadership practices was too generic to differentiate between individual leadership practices, a key feature of any study aiming to identify specific high-leverage leadership practices.
Fourth and finally, the effects of school leadership teacher evaluation practices on teacher performance ultimately affect student outcomes, theoretically (Hallinger, 2011).Future research might explore the effects of evaluation practices on student academic and non-academic outcomes.

Implications
The findings have implications for research, policy, and practice.First, the evidence extends and refines Hallinger's leadership for learning model by suggesting which leadership practices for teacher evaluation are likely to improve student learning and which are not.Like other recent evidence (Hunter and Springer, 2022), the findings herein imply that leadership practices concerning feedback qualities are not high-leverage.Instead, the most consistent evidence supporting Hallinger's model concerned leadership practices linking teacher evaluation to professional learning opportunities.
Policymakers might act on these findings by ensuring that school leaders can access opportunities to develop the identified high-leverage evaluation practices.Furthermore, leadership practices linking teacher evaluation to professional learning opportunities might be inhibited if schools cannot access adequate professional learning resources purposefully tied to the aspects of teacher performance assessed by classroom observation rubrics.Thus, policymakers might take steps to ensure that such professional learning systems exist Finally, identifying high-leverage evaluation practices provides helpful information for practitioners and leadership preparation programs.Leadership preparation or district-based in-service programs might emphasize these high-leverage practices when training or teaching aspiring or in-service leaders.More broadly, the evidence implies that such programs might underscore the importance of leadership practices linking teacher evaluation to professional learning.Even in the absence of policymaker or district leader action on this study's findings, the evidence suggests that school leaders themselves might be able to improve teaching by implementing the six highleverage teacher evaluation practices identified.