The assumptions behind logistic regression are different from linear regression. For example, we do not need to assume that the effect of the x-variable(s) on y is linear, homoscedasticity or normality.
More informationhelp logistic postestimation |
Checklist
| Binary outcome | The y-variable has to be binary. Also double-check that the proportion of “cases” (or “non-cases”, for that matter) is not too small. |
| Independence of errors | Data should be independent, i.e. not derived from any dependent samples design, e.g. before-after measurements/paired samples. |
| Correct model specification | Your model should be correctly specified. This means that the x-variables that are included should be meaningful and contribute to the model. No important (confounding) variables should be omitted (often referred to as omitted variable bias). |
| Linear relationship | There has to be a linear relationship between any continuous x-variable(s) and the log odds of the y-variable (not the same as the linearity assumed for linear regression). |
| No outliers | Outliers are individuals who do not follow the overall pattern of data. Sometimes referred to as influential observations (however, not all outliers are influential). Only relevant for continuous x-variables. |
| No multicollinearity | Multicollinearity may occur when two or more x-variables that are included simultaneously in the model are strongly correlated with each another. Actually, this does not violate the assumptions, but is does create greater standard errors which makes it harder to reject the null hypothesis. |
Most importantly, the model should fit the data. There are several tests to determine “goodness of fit” or, put differently, if the estimated model (i.e. the model with one or more x-variables) predicts the outcome better than the null model (i.e. a model without any x-variables).
Before going into any specific tests, we need to address the issues of “sensitivity” and “specificity”. By comparing the cases and non-cases predicted by the model with the cases and non-cases actually present in the outcome, we can draw a conclusion about the proportion of correctly predicted cases (sensitivity) and the proportion of correctly classified non-cases (specificity).
Sensitivity and specificity
| Estimated model | |||
| Non-case | Case | ||
| “Truth” | Non-case | True negative | False positive |
| Case | False negative | True positive |
A general comment about model fit: if the main interest was to identify the best model to predict a certain outcome, that would solely guide which x-variables we put into the analysis. For example, we would exclude x-variables that do not contribute to the model’s predictive ability. However, research is typically guided by theory and by the interest of examining associations between variables. If we thus have good theoretical reasons for keeping a certain x-variable or sticking to a certain model, we should most likely do that (but still, the model should not fit the data horribly). Model diagnostics will then be a way of showing others the potential problems with the model we use.
Types of model diagnostics
| Link test | Assess model specification |
| Box-Tidwell and exponential regression models | Check for linearity |
| Deviance and leverage | Check for influential observations |
| Correlation matrix | Check for multicollinearity |
| The Hosmer and Lemeshow test | Asses goodness of fit |
| ROC curve | Asses goodness of fit |