Introduction

Written by:

Ylva B Almquist

Logistic regression is used when y is categorical with only two categories, i.e. dichotomous/binary (see Measurement scales).

Cases and non-cases

A logistic regression is based on the fact that the outcome has only two possible values: 0 or 1. Often, the value 1 is used to denote a “case” whereas the value 0 is then a “non-case”. What is meant by case or non-case depends on how the hypothesis is formulated.

Example
a. We want to investigate the association between educational attainment (x) and employment (y). Our hypothesis is that educational attainment is positively associated with employment (i.e. higher educational attainment = more likely to be employed).
Coding of employment: 0=Unemployment (non-case); 1=Employment (case).

b. We want to investigate the association between educational attainment (x) and unemployment (y). Our hypothesis is that educational attainment is negatively associated with unemployment (i.e. higher educational attainment = less likely to be unemployed).
Coding of unemployment: 0=Employment (non-case); 1=Unemployment (case).

Logistic regression is used to predict the odds of being a case, compared to not being a case, based on the values of x. We get a coefficient – called log odds – that shows the effect of x on y. The log odds are the natural logarithm of the odds. These coefficients are not easy to interpret. Instead, we usually focus on something called the odds ratio (OR). The OR is calculated by taking the exponent of the coefficient. This part is further explained below.

Why not use linear regression?

With linear regression, we model the mean outcome. When we have a binary outcome, the mean is a probability.

What is a probability?

The extent to which an event is likely to occur. Or, if we stick to the terminology presented earlier: the extent to which the outcome is likely to be a case.
If the probability of the outcome being a case is p, then the probability of the outcome being a non-case is 1-p.
The formula can be expressed as: p(case)=number of cases/total number of cases+non-cases.
Probabilities always range between 0 and 1.

Example
We have a sample of 10 individuals, of which 3 are diagnosed with depression (cases), and 7 are not (non-cases). The probability of depression in the sample is thus 3/10=0.3 (can also be expressed as percentages, which would be 30%).

Moreover, 5 of the individuals are men, of which 1 is a case and 4 are non-cases. The remaining 5 individuals are women, of which 2 are cases and 3 are non-cases. The probability of depression among men is thus 1/5=0.2 (20%) whereas the probability of depression among women is 2/5=0.4 (40%).

If we were to fit a linear regression for a binary outcome, it is fully possible that we will have predicted values that are outside of the range of probabilities (i.e. below 0 and/or above 1). See for example the figure below, where a binary outcome is modelled together with a continuous x-variable, using linear regression.

Apart from this, applying a linear regression to a binary outcome will violate several of the other assumptions of linear regression analysis (normality, homoscedasticity).

The logistic function

Instead, we can apply a generalised linear model (GLM) which uses a link function that allows the outcome to vary linearly with the predicted values instead of varying linearly with the x-variable(s). For logistic regression, the link function that we choose is the logistic function. Through this, we restrict the probabilities to vary between 0 and 1. See for example the figure below, where a binary outcome is modelled together with a continuous x-variable, using logistic regression.

But how does the logistic function work? Well, it transforms probabilities to log odds, using maximum likelihood estimation. For logistic regression, the maximum likelihood method is the equivalent to ordinary least squares (OLS).

What are odds?

The probability that the outcome will be a case, divided by the probability that the outcome will be a non-case.
Can take any value from zero to infinity.
If the probability of the outcome being a case is p, then the odds of the outcome being a case is p/(1-p).

Example
In our sample, the probability of having been diagnosed with depression is 0.3. This means that the odds of depression are 0.3/1-0.3=0.4286 (rounded value).

For men, the probability of having been diagnosed with depression is 0.2. Their odds of depression are thus 0.2/1-0.2=0.25. For women, the probability of having been diagnosed with depression is 0.4. Their odds of depression are thus 0.4/1-0.4=0.6667 (rounded value).

What are log odds?

The logarithm of the odds (the logarithm is the power to which a number must be raised in order to produce some other number).
Also referred to as the logit of the probability.
Can take any value.
Is symmetric around zero.
Estimated as: log(p/1-p).

Example
In our sample, the odds of depression are 0.4286. This corresponds to log odds of -0.8472. For men, the odds of depression are 0.25, which means that the log odds are -1.3863. The odds among women are 0.667 and their log odds are thus -0.4054. That is a difference of 0.9808 (rounded value) between men and women (women have 0.9808 higher log odds than men).

So far, so good! But as we previously mentioned, the log odds are not easy to interpret. That is why it is very common to convert them to odds ratios.

What is an odds ratio (OR)?

The exponent of the log odds (the exponent is a special way expressing repeated multiplications).
Can take any value from zero to infinity.
Estimated as: exp(log odds).

Example
A difference of 0.9808 between men and women, corresponds to an OR of 2.67 (rounded value). Thus, women have 2.67 times the odds of depression compared to men.

Before we continue, let us revisit the mathematical expression for linear regression:

y=a+bx+e

y (or rather y hat; ŷ) is the predicted value of y.
a is the intercept (or constant), i.e. the value of y when x=0.
b is the slope (steepness) of the regression line, i.e. how much y changes per unit increase in x.
x is the value of x.
e is the error term (or residual), i.e. the error in predicting the value of y given the value of x.

For logistic regression, the formula is:

log(p/1-p)=a+bx+e

log(p/1-p) is the log transformation of the probability that the outcome will be a case, divided by the probability that the outcome will be a non-case.
a is the intercept (or constant), i.e. the log odds of y when x=0.
b is the change in log odds per unit increase in x. Can be transformed to odds ratio by taking the exponent of b. Can be transformed back to log odds by taking the log of the odds ratio.
x is the value of x.
e is the error term (or residual), i.e. the error in predicting the probability of y given the value of x.

Other names for logistic regression

We have chosen to use the term logistic regression when we refer to binary logistic regression or binomial regression (in reality, ordinal regression and multinomial regression are also types of logistic regressions, see Ordinal regression & Multinomial regression). Other names for this type of regression model are, e.g., logit regression and generalized linear model (GLM) with logit link function.