Linear regression in short

Written by:

Ylva B Almquist

If you have only one x, it is called simple regression, and if you have more than one x, it is called multiple regression.

Regardless of whether you are doing a simple or a multiple regression, x-variables can be categorical (nominal/ordinal) and/or continuous (ratio/interval).

Key information from linear regression

Effect
B coefficient (B)	The change in y, per unit increase in x
Direction
Negative	B below 0
Positive	B above 0
Statistical significance
P-value	p<0.05 Statistically significant at the 5% level p<0.01 Statistically significant at the 1% level p<0.001 Statistically significant at the 0.1% level
95% Confidence intervals	Interval does not include 0: Statistically significant at the 5% level Interval includes 0: Statistically non-significant at the 5% level

B coefficient (B)

In linear regression analysis, the effect that x has on y is reflected by a B coefficient (B):

Negative B coefficient	For every unit increase in x, y decreases by [B].
Positive B coefficient	For every unit increase in x, y increases by [B].

Exactly how one interprets the B coefficient in plain writing depends on the measurement scale of the x-variable. That is why we will present examples later for continuous, binary, and categorical (non-binary) x-variables.

Note
What the B coefficient actually stands for depends on the values of x and y.

P-values and confidence intervals

In linear regression analysis you can get information about statistical significance, in terms of both p-values and confidence intervals.

Note
The p-values and the confidence intervals will give you partly different information, but they are not contradictory. If the p-value is below 0.05, the 95% confidence interval will not include 0 and, if the p-value is above 0.05, the 95% confidence interval will include 0.

When you look at the p-value, you can rather easily distinguish between the significance levels (i.e. you can directly say whether you have statistical significance at the 5% level, the 1% level, or the 0.1% level).

When it comes to confidence intervals, Stata will by default choose 95% level confidence intervals. It is however possible to change the confidence level for the intervals. For example, you may instruct Stata to show 99% confidence intervals instead.

For more information about statistical significance, see Statistical significance.

R-Squared

You also get information about something called R-Squared or R2. This term refers to amount of the variance in y that is explained by the inclusion of the x-variable. The R2 value ranges between 0 and 1 – a higher value means a higher amount of explained variance. Generally speaking, the higher the R2 values, the better the model fits the data (i.e. the model has better predictive ability).

Simple versus multiple regression models

The difference between simple and multiple regression models, is that in a multiple regression each x-variable’s effect on y is estimated while accounting for the other x-variables’ effects on y. We then say that these other x-variables are “held constant”, or “adjusted for”, or “controlled for”. Because of this, multiple regression analysis is a way of dealing with the issue of confounding variables, and to some extent also mediating variables (see Z: confounding, mediating and moderating variables).

It is highly advisable to run a simple regression for each of the x-variables before including them in a multiple regression. Otherwise, you will not have anything to compare the adjusted coefficients with (i.e. what happened to the coefficients when other x-variables were included in the analysis). Including multiple x-variables in the same model usually (but not always) means that they become weaker – which would of course be expected if the x-variables overlapped in their effect on y.

A note

Remember that a regression analysis should always follow from theory as well as a comprehensive set of descriptive statistics and knowledge about the data. In the following sections, we will – for the sake of simplicity – not form any elaborate analytical strategy where we distinguish between x-variables and z-variables (see X, Y, and Z ). However, we will define an analytical sample and use a so-called pop variable (see From study sample to analytical sample).