We will explore three complementary ways of identifying influential observations. Remember that this is only relevant for the continuous x-variables in our model.
| Explanation | Rule of thumb | |
| Standardised Pearson residuals | The relative deviations between the observed and fitted values. | Statistic >+/-2 |
| Deviance residuals | The difference between the maxima of the observed and the fitted log likelihood functions. | Statistic >+/-2 |
| Leverage | How far that an x-variable deviates from its mean. | Statistic >3 times of the average of leverage |
More informationhelp logistic postestimation |
Practical example
The first step is to re-run our multiple regression model. The quietly option is included in the beginning of the command to suppress the output.
quietly logistic earlyret bmi sex ib1.educ if pop_logistic==1 |
Then we generate a new variable – rstandard – that contains the standardised Pearson residuals.
predict rstandard, rstandard |
Next, we generate a scatterplot for rstandard, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0.
graph twoway scatter rstandard id, mlab(id) yline(0) |

Well, we can see that there are plenty of observations that have residuals greater than 2.
Let us continue with the deviance residuals. We start with generating a new variable – deviance – which contains the deviance residuals.
predict deviance, deviance |
Next, we generate a scatterplot for deviance, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0.
graph twoway scatter deviance id, mlab(id) yline(0) |

This graph too shows that there are a lot of observations that have higher deviance residuals than we would like (>2).
Next, we consider the leverage. We begin by generating a new variable – hat – which contains the leverage values.
predict hat, hat |
Next, we generate a scatterplot for hat, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0.
graph twoway scatter hat id, mlab(id) yline(0) |

In order to know which observations that display problematic values for leverage, we need to know what the mean leverage is:
mean hat |

Mean leverage x 3 (our preferred cut-off value, specified earlier), equals 0.0053772 (i.e. 0.017924 x 3).
There are some observations with higher values than this, but it is a bit tricky to see how many. Let explore this further.
sum id if hat>0.0053772 & hat!=. |

We thus have 43 observations with leverage values that might be considered too high. To see their id number, we can order another table (output will be omitted due to how long it gets):
tab id if hat>0.0053772 & hat!=. |
So, what should we do with all of this information? There are some additional commands that can be used to explore the importance of each (potentially) influential observation further. However, once again, our advice would be to give up on the continuous version of bmi and use a categorised one instead.