B. Simulation Results
This section beigins with a discussion of the standards we used to assess the relative
performance of the different missing data methods that we tested, and then provides
a summary of the simulation results (complete results are provided in
Appendix D).
Assessing the Performance of the Different Methods
The judge the perforance of the difference missing data methods, we assessed the
extent to which each approach produced bias in the impact estimate that would be
cinsidered "high" relative to the benchmark set by the What Works Clearinghouse
(WWC). In the WWC, RCTs with attrition rates that are likely to yield non-response
bias of 0.05 standard deviations or greater are treated as if they were quasi-experimental
studies and are required to provide additional evidence suggesting that impact estimates
are unbiased. (.S. Department of Education, 2008U).
Because the RCTs in education that are currently underway may at some point be subject
to review by the WWC, we decided to accept the 0.05 standard deviation threshold
for bias in assessing the performance of different missing data methods. In each
of our simulations, methods that yielded bias in the impact estimate of greater
than 0.05 standard deviations were deemed to have produced "high bias," while methods
that yielded bias of less than 0.05 standard deviations were deemed to have produced
"low bias."
Additionally, some of the methods may also yield biased standard errors which contribute to the hypothesis test of whether the impact estimate is statistically significant.72 Therefore, we decided it was also important to set standards for assessing the magnitude of the bias in the estimates of the standard errors. We classified the bias in a standard error estimate as large ("high bias") if it would generate as much bias in the t-statistic as is produced a 0.05 standard deviation bias in the impact estimate itself. In this way, we rely entirely on the WWC's attrition standard to determine whether the bias in the impact estimate or standard error should be treated as large ("high bias") or small ("low bias"). For more details on how we calculated the bias thresholds for the standard errors, see Appendix E.
Simulation Results
Exhibit 3 summarizes the results from the simulations in which data were
missing from 40 percent of students within each school; Exhibit 4 summarizes the
results from the simulations in which data were missing from 40 percent of schools.
Each table presents the two key performance measures: (1) bias in the impact estimate,
and (2) bias in the estimated standard error. The tables include three columns,
one for each of the three scenarios—Scenario I, in which the data were missing
at random within group (treatment or control); Scenario II, in which the data were
missing at random after conditioning on group and pre-intervention characteristics
of the students (demographics and pretest scores); and Scenario III, in which the
missing data depended on the outcome measure— student post-test scores—even
after conditioning on group and pre-intervention characteristics of the students.
As discussed above, we also conducted simulations in which data were missing for five percent of students and schools, but none of the methods produced bias that exceeded the thresholds that we selected for these simulations under any of the three scenarios and for either missing pretests or post-tests. Therefore, we do not provide summary tables for the results from these simulations (the results themselves are provided in Appendix D).
We used the simulation results to assess the performance of different missing data methods when applied to the specific context that our simulations were designed to inform—Group Randomized Trials in which schools are randomized to treatment or control. Below, we present the results for the different missing data methods in those simulations for which pretest scores were collected and included in the impact analysis model.73
Case Deletion. Although often criticized, the technical literature provides a more nuanced view of case deletion. For example, Allison (2002) indicates that case deletion will work well in some situations and poorly in others. More specifically, he indicates that case deletion will yield biased impact estimates when missing data for an independent variable depends on the observed value of the dependent variable (see Allison 2002, p. 6).74 In our simulations, this scenario corresponds to missing pretest where the missingness depends on the post-test (Scenario III). However, he also indicates that case deletion yields less bias in the coefficient estimates than other methods when missing data on an independent variable depends on its unobserved value.75 In our simulations, this corresponds to missing pretest where the missingness depends on the pretest (Scenario II).
The results from our simulations are consistent with Allison's assessment. When pretest data were missing for 40 percent of students, and missingness depended on the value of the post-test (Scenario III), case deletion yielded impact estimates with bias that exceeded 0.05 standard deviations. In contrast, most other methods yielded impact estimates with bias of less than 0.05 standard deviations (see Exhibits 3 and 4). In addition, under Scenario III, when pretest scores were missing for either 40 percent of students or 40 percent of schools, case deletion produced impact estimates with greater bias than all of the other methods we tested, except for mean value imputation (see Appendix D, Tables III.b.1 and III.b.2).
However, also consistent with Allison's assessment, when missing pretest scores depended on the value of the pretest itself (Scenario II), case deletion yielded impact estimates with bias of less than 0.05 standard deviations (see Exhibits 3 and 4). In this scenario, case deletion of missing students or schools produced impact estimates that were closer to the true impact of 0.20 than all of the other methods we tested (see Appendix D, Tables II.b.1 and II.b.2). Therefore, in summary, the simulation results for case deletion closely matched the results that the literature would lead us to expect for missing pretest scores, i.e., case deletion produced impact estimates with less bias than other methods under some conditions and more bias than other methods under other conditions.
For missing post-test scores, however, case deletion worked as well as, or better than, all of the alternative methods across all of the missing data scenarios. In most of the missing post test scenarios, this method produced impact estimates that were less biased than the thresholds set for the simulations, and in all scenarios the biases in standard errors were less than the WWC-based thresholds. In the simulations where this method produced impact estimates with bias that exceeded 0.05 standard deviations (Scenario III, missing post-test scores for 40 percent of students), none of the other methods produced impact estimates with bias of less than 0.05 standard deviations.
Finally, with respect to bias in the impact estimates from missing post-test scores, case deletion performed similarly to other methods in the following sense: it only produced bias of greater than 0.05 standard deviations under Scenario III when data were missing for 40 percent of schools (see Exhibit 3). In all other simulations, it produced bias of less than 0.05 standard deviations (see Exhibits 3 and 4). In addition, for missing post-test scores, the difference in bias between the case deletion and other tested methods were, in all but one case, less than 0.01 (see Appendix D). For example, under Scenario II, when data were missing for 40 percent of students, multiple stochastic regression imputation yielded an impact estimate that exactly equals the true impact of 0.20, while case deletion yields an impact estimate that equals 0.193—a difference of 0.007 standard deviations.
In summary, case deletion produced bias in the impact estimates that exceeded the WWC-based threshold in two of our simulations:
Dummy Variable Method. The dummy variable method has been criticized in the literature for producing biased coefficient estimates (Allison, 2002 and Jones, 1996). While Jones (1996) is commonly cited as evidence that this method yields biased estimates, the appendix to this journal article provides a proof that the coefficient estimates will be unbiased if the two independent variables in his example, the one with missing data and the one without missing data, are uncorrelated with each other. In RCTs the variable of interest is the treatment indicator, which is never missing. Furthermore, when data are complete, randomization ensures that the treatment indicator is uncorrelated with the other independent variables. This raises the question of whether the standard critique of the dummy variable method applies in the particular context of education RCTs—in particular, in Group Randomized Trials where schools are randomly assigned to treatment or control, but the pretest score (or some other covariate) is missing for some students or schools.
The evidence from the simulation results indicates that for missing pretest scores, the dummy variable method performed similarly to the more sophisticated methods. In particular, we found that the dummy variable method produced impact estimates with bias of less than 0.05 standard deviations under all three scenarios (see Exhibits 3 and 4). In addition, in none of our simulations did the dummy variable produce standard errors with bias that exceeded the threshold established for these simulations (see Appendix D). Therefore, our simulation results cast doubt on whether the general concerns about this method, which we do not dispute, should deter analysts from adopting it in studies that randomly assign schools to educational interventions.
In summary, the dummy variable method produced impact estimates and standard error estimates with bias that fell within the acceptable range, as defined by the WWC-based criteria that we selected, in all of our simulations.
Mean Value Imputation. In general, mean value imputation is known to produce biased estimates of the standard errors of coefficients in regression models (see Allison, 2002 and Haitovsky, 1968). While there is no particular reason to believe this conclusion would not apply to RCTs in which schools are randomly assigned to treatment or control, our simulations shed light on whether this method, when applied to missing pretest scores or missing post-test scores, yields standard error estimates (1) with more or less bias than other methods, and (2) with bias that exceeds the threshold we developed for these simulations.
When data were missing for pretest scores, mean value imputation did not produce standard error estimates with bias that exceeded the WWC-based thresholds chosen for these simulations. When data were missing for 40 percent of students, however, mean value imputation produced impact estimates with bias that exceeded 0.05 for Scenarios II and III (see Exhibits 3 and 4).
When data were missing for post-test scores, mean value imputation produced standard error estimates with bias that exceeded the WWC-based thresholds in many of our simulations (see Exhibits 3 and 4). In fact, mean value imputation was the only method to yield standard errors with bias that exceeded the chosen thresholds in all three scenarios when data were missing for 40 percent of students and when data were missing for 40 percent of schools (see Exhibits 1 and 2). Finally, it is worth noting that when data were missing for 40 percent of students, mean value imputation was the only method to yield bias of greater than 0.05 standard deviations under both Scenarios II and III.
In summary, mean value imputation produced bias in the impact estimates and standard errors that exceeded the WWC-based thresholds in several of our simulations:
Single Non-Stochastic Regression Imputation. In general, single non-stochastic regression imputation is well-known to yield standard error estimates that are biased downward (see Chapter 3). When used to impute missing pretest or post-test scores in a Group Randomized Trial, our simulation results can be used to address the question of whether this method yields standard error estimates (1) with more or less bias than other methods, and (2) with bias that exceeds the threshold chosen for these simulations.
When either pretest or post-test scores were missing for 40 percent of schools—the unit of random assignment—single non-stochastic regression imputation produced standard error estimates with bias that exceeded the WWC-based threshold (see Exhibit 4). In fact, when pretest scores were missing for 40 percent of students, the estimated bias was greater for this method than for any of the other methods.
However, when either pretest or post-scores were missing for 40 percent of students within each school, single non-stochastic regression imputation produced standard error estimates with bias that fell below the WWC-based threshold (see Exhibit 3). In fact, the estimated bias was less than or equal to 0.001 standard deviations, or one percent of the true standard error, in all three simulations. This suggests that when schools are randomly assigned but data are missing at the student level, the general concerns about single non-stochastic regression imputation may not apply.
In summary, single non-stochastic regression imputation produced bias in the impact estimates and standard errors that exceeded the WWC-based thresholds in several of our simulations:
Single Stochastic Regression Imputation. Single stochastic regression imputation is considered to be a "partial fix" to the problem associated with single non-stochastic regression imputation (see Chapter 3). Therefore, we would expect the bias in the standard error to be lower with single stochastic regression imputation than with single non-stochastic regression imputation.
Our simulation results indicate that relative to the WWC-based threshold for bias in the standard error, single stochastic regression imputation performed equally to single non-stochastic regression imputation. By this, we specifically mean that in each simulation (e.g., for both missing pretests and missing post-tests in all three scenarios), both methods produced standard errors that either exceeded the bias threshold or fell below the threshold (see Exhibits 3 and 4).
However, these results should not be interpreted as evidence against the conclusion from the literature that single stochastic regression imputation produces standard errors with less bias than single non-stochastic regression imputation. For each of the simulations with missing data for 40 percent of schools, the estimated bias was smaller for single stochastic regression imputation than for single non-stochastic regression imputation (see Appendix D, Tables I.b.2, II.b.2, and III.b.2).
Finally, with respect to bias in the impact estimates themselves, and relative to the WWC-based threshold, Exhibits 3 and 4 show that single stochastic regression imputation performed equivalently to most other methods, including single non-stochastic regression imputation. In all but one of the simulations, the bias in the impact estimate was either greater than 0.05 standard deviations for both of these two methods or less than 0.05 standard deviations for both methods.76 This is not surprising since the addition of a stochastic error term to the imputed values is not intended to reduce bias in the impact estimate; rather, it is intended to reduce bias in the estimated standard error of the impact estimate.
In summary, single stochastic regression imputation produced bias in the impact estimates and standard errors that exceeded the WWC-based thresholds in some of our simulations:
Multiple Stochastic Regression Imputation. Multiple stochastic regression imputation is considered to be a technically appropriate solution to the problem associated with single stochastic regression imputation (see Chapter 3). Therefore, we would expect the bias in the standard error to be either low or zero—and lower than the bias from single stochastic regression imputation.
The simulation results were consistent with this expectation. In all of our simulations, multiple stochastic regression imputation produced standard errors with bias estimates that fell below the WWC-based threshold selected for these simulations (see Exhibits 3 and 4), including the scenarios where both of the single regression imputation methods produced bias that exceeded the WWC-based threshold (see Exhibit 4).
With respect to bias in the impact estimates themselves, and relative to the WWC-based threshold, Exhibits 3 and 4 show that multiple stochastic regression imputation performed equivalently to most other methods, including single stochastic regression imputation. In all but one the scenarios, the bias in the impact estimate was either greater than 0.05 standard deviations for both of these two methods or less than 0.05 standard deviations for both methods.77 This is not surprising since multiple imputation is not designed to produce less biased impact estimates than single stochastic regression imputation: it is intended to reduce bias in the estimated standard error of the impact estimate.
In summary, multiple stochastic regression imputation produced bias in the impact estimates that exceeded the WWC-based thresholds in only one of our simulations: when post-tests scores were missing for 40 percent of students under Scenario III. Under this scenario, none of the methods produced impact estimates with bias of less than 0.05 standard deviations.
EM Algorithm with Multiple Imputation. As discussed in Chapter 3, the EM algorithm is a maximum likelihood approach that can be used to directly obtain coefficient estimates or to impute missing values. When combined with multiple imputation, the literature suggests this approach should yield standard errors with little or no bias (see Chapter 3).
The simulation results are consistent with this expectation. Our simulation results indicate that relative to the WWC-based threshold for bias in the standard error, the EM algorithm with multiple imputation performed equally to multiple stochastic regression imputation. When data were missing for 40 percent of students or schools, the EM algorithm with multiple imputation produced standard errors with estimated bias that fell below the WWC-based threshold selected for these simulations in all three scenarios and for both missing pretests and missing post-tests (see Exhibits 3 and 4).
In addition, with respect to bias in the impact estimates themselves, relative to the WWC-based threshold, Exhibits 3 and 4 show that the EM algorithm with multiple imputation performed equivalently to most other methods, including multiple stochastic regression imputation. Like multiple stochastic regression imputation, when data were missing for 40 percent of students, the EM algorithm with multiple imputation produced impact estimates with bias of less than 0.05 when the missing data mechanism could be characterized as MCAR or MAR (e.g., Scenarios I and II for missing post-test scores), and it produced impact estimates with bias of greater than 0.05 when the missing data mechanism could be characterized as NMAR (e.g., Scenario III for missing post-test scores). When data were missing for 40 percent of schools, the EM algorithm with multiple imputation produced impact estimates with bias of less than 0.05 standard deviations for both missing pretests and post-tests under all three scenarios.
In summary, the EM algorithm with multiple imputation produced bias in the impact estimates that exceeded the WWC-based thresholds in only one of our simulations: when post-tests scores were missing for 40 percent of students under Scenario III.
Weighting, Simple Approach. The simple weighting approach, which can be applied in evaluations in which data are missing for selected students in each school, involves weighting up the students with non-missing data to the count of all students in the school. If the impact of the intervention varies across schools, this method might be expected to produce impact estimates with less bias than case deletion. If impact of the intervention does not vary across schools, then we might expect this method to produce impact estimates with bias that is equivalent to that of case deletion. In our simulation scenarios, the true impact was constant across schools. Therefore, we would not expect this method to produce impacts that are much different from the impacts produced by case deletion.
The simulation results are consistent with these expectations. Relative to the bias standards that we adopted for both impacts and standard errors, the performance of the simple weighting approach was equivalent to the performance of case deletion for all simulations (see Exhibits 3 and 4). In addition, the difference in impacts between the simple weighting approach and case deletion was less than or equal to 0.003 standard deviations in all of the simulations (see Appendix D, Tables I.b.1, II.b.1, and III.b.1).
In summary, the simple weighting approach produced bias in the impact estimates that exceeded the WWC-based thresholds in only one of our simulations: when post-tests scores were missing for 40 percent of students under Scenario III.
Weighting, More Sophisticated Approach. As described earlier, this approach involves estimating a propensity model and using this model to assign weights to cases with non-missing data. In the literature, this method is considered an acceptable alternative to multiple imputation.
Relative to the bias standards that we adopted for both impacts and standard errors, the performance of the more sophisticated weighting approach was equivalent to the performance of both multiple stochastic regression imputation and the EM algorithm with multiple imputation for all simulations (see Exhibits 3 and 4). In addition, the difference in impacts between the more sophisticated weighting approach and multiple stochastic regression imputation was less than or equal to 0.001 standard deviations in simulations with missing data for 40 percent of students (see Appendix D, I.b.1, II.b.1, and III.b.1) and less than or equal to 0.01 standard deviations in simulations with missing data for 40 percent of schools (see Appendix D, Tables I.b.2, II.b.2, and III.b.2).
In summary, the more sophisticated weighting approach produced bias in the impact estimates that exceeded the WWC-based thresholds in only one of our simulations: when post-tests scores were missing for 40 percent of students under Scenario III.
Fully Interacted Regression Models with Treatment-Covariate Interactions. As described earlier in this chapter, we tested this approach by adding the interaction between the treatment indicator and the pretest variable as an independent variable in the model used to estimate the impacts of the intervention. This method ensures that the average treatment effect is evaluated at the mean for the entire sample—not just the mean for the sample with complete post-test data. Because of this, we would expect this method to produce impact estimates with less bias than case deletion. However, we had no prior expectations regarding the expected performance of this method relative to the other methods.
Relative to the bias standards that we adopted for both impacts and standard errors, the performance of this method was equivalent to the performance of the methods we have recommended thus far (see Exhibits 3 and 4). In addition, the difference in impacts between fully interacted regression models with treatment-covariate interactions and multiple stochastic regression imputation was less than or equal to 0.002 standard deviations in simulations with missing data for 40 percent of students (see Appendix D, Tables I.b.1, II.b.1, and III.b.1) and less than or equal to 0.009 standard deviations in simulations with missing data for 40 percent of schools (see Appendix D, Tables I.b.2, II.b.2, and III.b.2).
In summary, fully interacted regression models with treatment-covariate interactions produced bias in the impact estimates that exceeded the WWC-based thresholds in only one of our simulations: when post-tests scores were missing for 40 percent of students under Scenario III.
Testing a Range of Missing Data Rates
As discussed above, our simulations tested the performance of selected missing data
methods at two levels of missing data for schools and students, i.e., we ran the
simulations at five percent and 40 percent missing, respectively. This raised an
obvious question, "Is there a point along this range of possible attrition at which
the results change?" To explore the sensitivity of the results to intermediate missing
data rates, we ran simulations within the 5%-40% range for a subset of missing data
methodologies. In particular, for missing post-test data and Scenario III—the scenario
that analysts worry the most about because the data are NMAR—we tested the performance
of case deletion, non-stochastic regression imputation, and multiple stochastic
regression imputation with missing data rates of 10 percent, 20 percent, and 30
percent. Then we combined those results with the results for missing data rates
of 5 percent and 40 percent to map out the relationship between the missing data
rate and the performance of these three measures. We found that as the missing data
rate increases, the bias also increased; however, these changes are smooth and gradual,
revealing no obvious "tipping point."