Deviance and leverage

We will explore three complementary ways of identifying influential observations. Remember that this is only relevant for the continuous x-variables in our model. 

ExplanationRule of thumb
Standardised Pearson residualsThe relative deviations between the observed and fitted values. Statistic >+/-2 
Deviance residuals The difference between the maxima of the observed and the fitted log likelihood functions. Statistic >+/-2 
Leverage How far that an x-variable deviates from its mean. Statistic >3 times of the average of leverage 

More information
help logistic postestimation

Practical example

The first step is to re-run our multiple regression model. The quietly option is included in the beginning of the command to suppress the output. 

quietly logistic earlyret bmi sex ib1.educ if pop_logistic==1

Then we generate a new variable – rstandard – that contains the standardised Pearson residuals. 

predict rstandard, rstandard

Next, we generate a scatterplot for rstandard, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0. 

graph twoway scatter rstandard id, mlab(id) yline(0)

Well, we can see that there are plenty of observations that have residuals greater than 2. 

Let us continue with the deviance residuals. We start with generating a new variable – deviance – which contains the deviance residuals. 

predict deviance, deviance

Next, we generate a scatterplot for deviance, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0. 

graph twoway scatter deviance id, mlab(id) yline(0)

This graph too shows that there are a lot of observations that have higher deviance residuals than we would like (>2). 

Next, we consider the leverage. We begin by generating a new variable – hat – which contains the leverage values. 

predict hat, hat

Next, we generate a scatterplot for hat, displaying the id variable on the x-axis. We also include the so-called marker labels (the values of id, in this case), and a regression line at y=0. 

graph twoway scatter hat id, mlab(id) yline(0)

In order to know which observations that display problematic values for leverage, we need to know what the mean leverage is: 

mean hat

Mean leverage x 3 (our preferred cut-off value, specified earlier), equals 0.0053772 (i.e. 0.017924 x 3).  

There are some observations with higher values than this, but it is a bit tricky to see how many. Let explore this further. 

sum id if hat>0.0053772 & hat!=.

We thus have 43 observations with leverage values that might be considered too high. To see their id number, we can order another table (output will be omitted due to how long it gets): 

tab id if hat>0.0053772 & hat!=.

So, what should we do with all of this information? There are some additional commands that can be used to explore the importance of each (potentially) influential observation further. However, once again, our advice would be to give up on the continuous version of bmi and use a categorised one instead.