Poisson regression in short

Written by:

Ylva B Almquist

If you have only one x, it is called simple regression, and if you have more than one x, it is called multiple regression.

Regardless of whether you are doing a simple or a multiple regression, x-variables can be categorical (nominal/ordinal) and/or continuous (ratio/interval).

Key information from Poisson regression

Effect
Incidence rate ratio (IRR)	The exponent of log incidence rate
Log incidence rate	The logarithm of incidence rate
Incidence rate	The rate at which events occur
Direction
Negative	IRR below 1
Positive	IRR above 1
Statistical significance
P-value	p<0.05 Statistically significant at the 5% level p<0.01 Statistically significant at the 1 % level p<0.001 Statistically significant at the 0.1% level
95% Confidence intervals	Interval does not include 1: Statistically significant at the 5% level Interval includes 1: Statistically non-significant at the 5% level

Incidence-rate ratio (IRR)

In Poisson regression analysis, the effect that x has on y is reflected by an incidence rate ratio (IRR):

IRR below 1	For every unit increase in x, the incidence rate of y decreases.
IRR above 1	For every unit increase in x, the incidence rate of y increases.

Exactly how one interprets the IRR in plain writing depends on the measurement scale of the x-variable. That is why we will present examples later for continuous, binary, and categorical (non-binary) x-variables.

Note
Unlike linear regression, where the null value (i.e. value that denotes no difference) is 0, the null value for Poisson regression is 1.

Note
An IRR can never be negative – it can range between 0 and infinity.

How not to interpret incidence rate ratios

The incidence rate ratios produced with Poisson regression analysis are not the same as risk ratios (see Attributable proportion). IRRs tend to be inflated when they are above 1 and understated when they are below 1. This becomes more problematic the more common the outcome is (i.e. the more “non-zeros” we have). However, the rarer the outcome is (<10% is usually considered a reasonable cut-off here), the closer incidence rate ratios and risks ratios become.

Many would find it compelling to interpret IRRs in terms of percentages. For example, an IRR of 1.20 might lead to the interpretation that the incidence rate of the outcome increases by 20%. If the IRR is 0.80, some would then suggest that the incidence rate decreases by 20%. We would to urge you to carefully reflect upon the latter kind of interpretation since incidence rate ratios are not symmetrical: it can take any value above 1 but cannot be below 0. Thus, the choice of reference category might lead to quite misleading conclusions about effect size. The former kind of interpretation is usually considered reasonable when IRRs are below 2. If they are above 2, it is better to refer to “times”, i.e. an IRR of 4.07 could be interpreted as “more than four times the odds of…”.

Take home message

It is completely fine to discuss the results more generally in terms of higher or lower incidence rates/risks. However, if you want to give exact numbers to exemplify, you need to consider the asymmetry of incidence rate ratios as well as the size of the IRR.

P-values and confidence intervals

In Poisson regression analysis you can get information about statistical significance, in terms of both p-values and confidence intervals (also see P-values).

Note
The p-values and the confidence intervals will give you partly different information, but they are not contradictory. If the p-value is below 0.05, the 95% confidence interval will not include 1 and, if the p-value is above 0.05, the 95% confidence interval will include 1.

When you look at the p-value, you can rather easily distinguish between the significance levels (i.e. you can directly say whether you have statistical significance at the 5% level, the 1% level, or the 0.1% level).

When it comes to confidence intervals, Stata will by default choose 95% level confidence intervals. It is however possible to change the confidence level for the intervals. For example, you may instruct Stata to show 99% confidence intervals instead.

R-Squared

R-Squared (or R2) does not work very well due to the assumptions behind Poisson regression. Stata produces a pseudo R2, but due to inherent bias this is seldom used.

Simple versus multiple regression models

The difference between simple and multiple regression models, is that in a multiple regression each x-variable’s effect on y is estimated while accounting for the other x-variables’ effects on y. We then say that these other x-variables are “held constant”, or “adjusted for”, or “controlled for”. Because of this, multiple regression analysis is a way of dealing with the issue of confounding variables, and to some extent also mediating variables (see Z: confounding, mediating and moderating variables).

It is highly advisable to run a simple regression for each of the x-variables before including them in a multiple regression. Otherwise, you will not have anything to compare the adjusted coefficients with (i.e. what happened to the coefficients when other x-variables were included in the analysis). Including multiple x-variables in the same model usually (but not always) means that they become weaker – which would of course be expected if the x-variables overlapped in their effect on y.

A note

Remember that a regression analysis should follow from theory as well as a comprehensive set of descriptive statistics and knowledge about the data. In the following sections, we will – for the sake of simplicity – not form any elaborate analytical strategy where we distinguish between x-variables and z-variables (see Z: confounding, mediating and moderating variables). However, we will define an analytical sample and use a so-called pop variable (see From study sample to analytical sample).