Cox regression in short

Written by:

Ylva B Almquist

If you have only one x, it is called simple regression, and if you have more than one x, it is called multiple regression. Regardless of whether you are doing a simple or a multiple regression, the x-variables can be categorical (nominal/ordinal) and/or continuous (ratio/interval).

Key information from Cox regression

Effect
Hazard ratio (HR)	The exponent of hazard rate
Hazard rate	The probability that if the event has not yet occurred, it will occur in the next time interval, divided by the length of that interval.
Direction
Negative	HR below 1
Positive	HR above 1
Statistical significance
P-value	p<0.05 Statistically significant at the 5% level p<0.01 Statistically significant at the 1% level p<0.001 Statistically significant at the 0.1% level
95% Confidence intervals	Interval does not include 1: Statistically significant at the 5% level Interval includes 1: Statistically non-significant at the 5% level

Hazard ratio (HR)

In Cox regression analysis, the effect that x has on y is reflected by a hazard ratio (HR):

HR below 1: For every unit increase in x, the hazard rate of y decreases.

HR above 1: For every unit increase in x, the hazard rate of y increases.

Exactly how one interprets the HR in plain writing depends on the measurement scale of the x-variable. That is why we will present examples later for continuous, binary, and categorical (non-binary) x-variables.

Note
Unlike linear regression, where the null value (i.e. value that denotes no difference) is 0, the null value for Cox regression is 1.

Note
An HR can never be negative – it can range between 0 and infinity (∞).

How not to interpret hazard ratios

The hazard ratios produced with Cox regression analysis are not the same as risk ratios (see Attributable proportion). HRs tend to be inflated when they are above 1 and understated when they are below 1. This becomes more problematic the more common the outcome is (i.e. the more “cases” we have). However, the rarer the outcome is (<10% is usually considered a reasonable cut-off here), the closer hazard ratios and risks ratios become.

Many would find it compelling to interpret HRs in terms of percentages. For example, an HR of 1.20 might lead to the interpretation that the hazard rate of the outcome increases by 20%. If the HR is 0.80, some would then suggest that the hazard rate decreases by 20%. We would to urge you to carefully reflect upon the latter kind of interpretation since hazard ratios are not symmetrical: it can take any value above 1 but cannot be below 0. Thus, the choice of reference category might lead to quite misleading conclusions about effect size. The former kind of interpretation is usually considered reasonable when HRs are below 2. If they are above 2, it is better to refer to “times”, i.e. an HR of 4.07 could be interpreted as “more than four times the hazard rate of…”.

Take home messages

Do not interpret incidence hazard ratios as risk ratios, unless the outcome is very rare (<10%, but even then, be careful). It is completely fine to discuss the results more generally in terms of higher or lower hazard rates/risks. However, if you want to give exact numbers to exemplify, you need to consider the asymmetry of hazard ratios as well as the size of the HR.

P-values and confidence intervals

In Cox regression analysis you can get information about statistical significance, in terms of both p-values and confidence intervals (also see P-values).

Note
The p-values and the confidence intervals will give you partly different information, but they are not contradictory. If the p-value is below 0.05, the 95% confidence interval will not include 1 and, if the p-value is above 0.05, the 95% confidence interval will include 1.

When you look at the p-value, you can rather easily distinguish between the significance levels (i.e. you can directly say whether you have statistical significance at the 5% level, the 1% level, or the 0.1% level).

When it comes to confidence intervals, Stata will by default choose 95% level confidence intervals. It is however possible to change the confidence level for the intervals. For example, you may instruct Stata to show 99% confidence intervals instead.

R-squared

R-Squared (or R2) does not work very well due to the assumptions behind Cox regression. Stata produces a pseudo R2, but due to inherent bias this is seldom used.

Simple versus multiple regression models

The difference between simple and multiple regression models, is that in a multiple regression each x-variable’s effect on y is estimated while accounting for the other x-variables’ effects on y. We then say that these other x-variables are “held constant”, or “adjusted for”, or “controlled for”. Because of this, multiple regression analysis is a way of dealing with the issue of confounding variables, and to some extent also mediating variables (see Z: confounding, mediating and moderating variables). It is highly advisable to run a simple regression for each of the x-variables before including them in a multiple regression. Otherwise, you will not have anything to compare the adjusted coefficients with (i.e. what happened to the coefficients when other x-variables were included in the analysis). Including multiple x-variables in the same model usually (but not always) means that they become weaker – which would of course be expected if the x-variables overlapped in their effect on y.

A note

Remember that a regression analysis should follow from theory as well as a comprehensive set of descriptive statistics and knowledge about the data. In the following sections, we will – for the sake of simplicity – not form any elaborate analytical strategy where we distinguish between x-variables and z-variables (see Z: confounding, mediating and moderating variables). However, we will define an analytical sample and use a so-called pop variable (see From study sample to analytical sample).