Model diagnostics

Written by:

Ylva B Almquist

The assumptions behind logistic regression are different from linear regression. For example, we do not need to assume that the effect of the x-variable(s) on y is linear, homoscedasticity or normality.

More information
help logistic postestimation

Checklist

Binary outcome	The y-variable has to be binary. Also double-check that the proportion of “cases” (or “non-cases”, for that matter) is not too small.
Independence of errors	Data should be independent, i.e. not derived from any dependent samples design, e.g. before-after measurements/paired samples.
Correct model specification	Your model should be correctly specified. This means that the x-variables that are included should be meaningful and contribute to the model. No important (confounding) variables should be omitted (often referred to as omitted variable bias).
Linear relationship	There has to be a linear relationship between any continuous x-variable(s) and the log odds of the y-variable (not the same as the linearity assumed for linear regression).
No outliers	Outliers are individuals who do not follow the overall pattern of data. Sometimes referred to as influential observations (however, not all outliers are influential). Only relevant for continuous x-variables.
No multicollinearity	Multicollinearity may occur when two or more x-variables that are included simultaneously in the model are strongly correlated with each another. Actually, this does not violate the assumptions, but is does create greater standard errors which makes it harder to reject the null hypothesis.

Most importantly, the model should fit the data. There are several tests to determine “goodness of fit” or, put differently, if the estimated model (i.e. the model with one or more x-variables) predicts the outcome better than the null model (i.e. a model without any x-variables).

Before going into any specific tests, we need to address the issues of “sensitivity” and “specificity”. By comparing the cases and non-cases predicted by the model with the cases and non-cases actually present in the outcome, we can draw a conclusion about the proportion of correctly predicted cases (sensitivity) and the proportion of correctly classified non-cases (specificity).

Sensitivity and specificity

		Estimated model
		Non-case	Case
“Truth”	Non-case	True negative	False positive
	Case	False negative	True positive

A general comment about model fit: if the main interest was to identify the best model to predict a certain outcome, that would solely guide which x-variables we put into the analysis. For example, we would exclude x-variables that do not contribute to the model’s predictive ability. However, research is typically guided by theory and by the interest of examining associations between variables. If we thus have good theoretical reasons for keeping a certain x-variable or sticking to a certain model, we should most likely do that (but still, the model should not fit the data horribly). Model diagnostics will then be a way of showing others the potential problems with the model we use.

Types of model diagnostics

Link test	Assess model specification
Box-Tidwell and exponential regression models	Check for linearity
Deviance and leverage	Check for influential observations
Correlation matrix	Check for multicollinearity
The Hosmer and Lemeshow test	Asses goodness of fit
ROC curve	Asses goodness of fit