Imputation

In the earlier sections, we suggested that the preferable strategy is to exclude individuals with missing data for any of the study variables from our analysis. This is often referred to as complete case analysis. Such an approach might, however, lead to biased estimates, inadequate power, and inaccurate standard errors. 

Different types of imputation

An alternative is to apply imputation. Imputation means replacing missing data with substituted values, based on existing values in the data. An assumption is nonetheless that data are MCAR (or at least MAR) – which perhaps seldom is the case.

Types of imputation

Mean/MedianCalculate the mean or median for the variable and impute that value for all individuals who have missing information for that variable. A simple approach, but cannot be recommended since it introduces so much bias (e.g. reduces the variance).
Hot deck/cold deckRandomly (hot deck) or systematically (cold deck) choose a value from an individual in the sample who has similar values on all other study variables. Simple, but restricts the range of possible values to the range among observed values.
Last observation carried forwardCarry forward a value from the last observation for the same individual (works e.g. for repeated measurements) Simple, but reduces the variance. Yields (potentially too) conservative estimates.
RegressionUse the predicted value obtained by regressing the missing variable on other variables. Preserves the relationships between the variables but not the variability around the predicted values.
Stochastic regressionUse the predicted value obtained by regression the missing variable on other variables, plus a random residual value. Improves the regression imputation by adding a random component.
ExtrapolationEstimate a value from other observations for the same individual (works e.g., for repeated measurements). Might, however, mean that one would estimate values beyond the actual range of data.

Single versus multiple imputation

Single imputation means coming up with one single value of the missing value – which is simple and therefore quite compelling approach. Unless the data are really MCAR (or at least MAR), single imputation might nevertheless produce bias that is worse than what you would get with a complete case analysis. 

The alternative is multiple imputation, which has become a very popular approach. This too assumes that missingness is MCAR or MAR. One starts by creating a number of sets of imputations for the missing values, based on an imputation method with a random component (such as hot deck imputation and stochastic regression imputation). After analysing each completed dataset, the results are combined. If performed well, multiple imputation leads to unbiased estimates and accurate standard errors.

Multiple imputation is not easy, and it requires deep knowledge about the dataset at hand. Therefore, we urge you to think long and hard about whether this is really a good strategy for your analysis. This guide will not cover any practical details about multiple imputation, but feel free to explore it further.

More information
help mi