One way to better understand the missing data problem is to see how it is related to the very first type of bias to which most of us were introduced in our first regression class, omitted variable bias. Suppose the true model of impacts is shown in equation (1):
(1) Y =β0 +β1X +β2Trt +ε, where ε ~ N(0,σ1• I)
where X is the baseline covariate and Trt is the treatment group indicator. However, suppose that the researchers conducting the RCT estimate a simpler model that excludes the baseline variable, as shown in equation (2):
(2) Y =α0 +α1Trt +v, where v ~ N(0,σ2 • I)
The decision to estimate equation (2) instead of (1) may be driven by lack of knowledge—the researchers may not realize that the baseline variable affects the outcome—or by necessity—if the variable is inherently unobservable. However, the decision may have been based on the belief that “simpler is better” in RCTs since without missing data problems, RCTs yield unbiased impact estimates even when no control variables are included.
However, suppose some of the data are missing. More specifically, suppose that the outcome variable (Y) is missing for some cases, and the researchers plan to drop cases with missing values. (Other approaches to missing data are considered in the body of the report, but the consequences of dropping cases with missing values may be the best tool for illustrating the consequences of missing data.) In this very common scenario, will the researchers obtain unbiased estimates of the treatment effect (β2)? The answer is “it depends” or “only in special cases.” If the observations with non-missing values are just a simple random sample of the larger sample (MCAR), the answer would be yes. The only consequence is a smaller sample and less statistical power. If the observations with non-missing values are at least random conditional the independent variables in the model (the MAR category), then the answer is still yes.
What does this mean for our simple example? It means the RCT can obtain unbiased impact estimates if (1) the data are missing completely at random (just a coin toss or roll of the dice) or (2) the data are missing at random within each group defined by the only covariate included in the model: the treatment indicator. Exactly how this can be achieved is a core portion of the remainder of this appendix. Scenario (2) warrants some additional consideration since a difference in response rates between the treatment and control groups might be taken as a sign that the impact estimates are biased. However, as long as the process behind the missing data is completely random within group, it does not matter if the percentage of cases with missing data differs between the two groups: the treatment effect will still be unbiased.
However, there are still two potential pitfalls that could lead to biased estimates of the average impact of the treatment (both fall under the NMAR case). First, even where the occurrence of “missingness” is unrelated to treatment status, it can be related to other variables that have been omitted from the model (like X has been omitted from equation (2)) and cause bias. This is a case where missingness causes the observed treatment and control group outcome samples to be “equally unrepresentative” of the population of interest (i.e., the population these samples would represent if the outcomes data were totally complete). For example, suppose the outcome variable in equation (2) is the student's score on the state assessment in reading, and the observed baseline variable excluded from the model (X in equation (1)) equals 1 for Limited English Proficiency (LEP) students and 0 for non-LEP students. If the missing data rate is larger for LEP students than for other students, and equally larger for the treatment group and the control group samples, then the analysis sample of students—that is, the students with non-missing data—would be skewed toward non-LEP students in both the treatment group and the control group.
So when is this a problem? It is a problem when the impact of the treatment differs between LEP students and other students. For example, suppose the impact of the program on reading achievement is larger for LEP students than for other students. If LEP students are underrepresented in the analysis sample due to missing data, this will pull the estimated impacts downward. In this example, and many like it, random assignment will provide an unbiased estimate of the treatment's average impact for students with nonmissing data. However, because missing data has skewed both the treatment and control samples toward non-LEP students, for whom the impacts are relatively small, equation (2) will yield an downwardly biased estimate of the treatment's average impact for students in the broader study sample (and for whatever population this sample was designed to represent).
The second potential pitfall arises if missing data are related to both treatment status and a variable that has been omitted from the model. In this context, the analysis sample in both groups (treatment and control) will be unrepresentative of the broader population of students. However, because missing data is related to both treatment status and the omitted variable, the analysis samples in the treatment group and the control group will not be “equally unrepresentative,” i.e., the treatment and control samples will be “differentially skewed” toward non-LEP students. While the first pitfall yields unbiased impact estimates for the wrong population, this pitfall yields biased impact estimates for the wrong population. In both instances, the wrong population is being studied in relation to the information policy makers need about the full set of students potentially impacted by an intervention.
To gain a better understanding of the second pitfall, let us build on the example developed in this section. The treatment and control samples used in the analysis could be “differentially skewed” toward non-LEP students if the treatment itself has a positive effect on English proficiency, and LEP students with higher English proficiency are more likely to be required to take the state test used to create the outcome variable for the analysis. In this scenario, within the analysis sample, the treatment group would be less skewed toward non-LEP students than the control group.
Mathematically, this introduces omitted variable bias by creating a positive correlation between treatment status (Trt in equation (2)) and the omitted variable (X or LEP status in equation (1)) in the observed sample.78 Among students with complete data, LEP students are more likely to be in the treatment group than in the control group. If LEP status has a negative effect on the outcome—reading achievement, as measured by the state test—the positive correlation between the treatment and LEP status among students in the analysis sample will produce a negative bias in the impact estimate. Put differently, in this scenario, the RCT will understate the true impact of the treatment. More generally, when there are a variety of omitted variables that are related to both the outcome and its missing data pattern the bias due to missing data could be positive or negative.
There are two major lessons that can be gleaned from this discussion: