Multiple linear regression

Written by:

Ylva B Almquist

Quick facts

Number of variables
One dependent (y)
At least two independent (x)

Scales of variable(s)
Dependent: continuous (ratio/interval)
Independent: categorical (nominal/ordinal) and/or continuous (ratio/interval)

Theoretical example

Example
Suppose we are interested to see if young children (x), residential area (x), and income (x) are related to the number of furry pets (y).

Having young children is measured as either 0=No young children and 1=Young children. Residential area has the values 1=Metropolitan, 2=Smaller city, and 3=Rural. We choose Metropolitan as our reference category. Income is measured as the yearly household income from salary in thousands of SEK (ranges between 100 and 700). The number of furry pets is measured as the number of cats, dogs or other furry animals living in the household, and ranges between 0 and 10.

We get a B coefficient for having young children that is 0.51. That means that the number of furry pets is higher among those who have young children. This association is adjusted for residential area and income.

With regards to residential area, the B coefficient for Smaller city is 2.02 whereas the B coefficient for Rural is 4.99. That suggests, firstly, that the number of furry pets is higher (about two more pets, on average) among individuals living in smaller cities compared to metropolitan areas. Secondly, the number of furry pets is much higher (almost five more pets, on average) among individuals living in rural areas compared to metropolitan areas. This association is adjusted for having young children and income.

Finally, the B coefficient for income is -0.1. This suggests that for every unit increase in income (i.e. for every additional one thousand SEK), the number of furry pets decrease by 0.1. This association is adjusted for having young children and residential area.

Practical example

Dataset

StataData1.dta

Variable name	gpa
Variable label	Grade point average (Age 15, Year 1985)
Value labels	N/A

Variable name	cognitive
Variable label	Cognitive test score (Age 15, Year 1985)
Value labels	N/A

Variable name	bullied
Variable label	Exposure to bullying (Age 15, Year 1985)
Value labels	0=No 1=Yes

Variable name	skipped
Variable label	Skipped class (Age 15, Year 1985)
Value labels	1=Never 2=Sometimes 3=Often

sum gpa cognitive bullied skipped if pop_linear==1

In this model, we have three x-variables: cognitive, bullied, and skipped. When we put them together, their statistical effect on gpa is mutually adjusted.

reg gpa cognitive bullied ib1.skipped if pop_linear==1

In the simple regression models, we had R-squared values of 0.3835 (for cognitive), 0.0100 (for bullied), and 0.0350 (for skipped). Now that we have a multiple regression analysis, it is better to look at the adjusted R-squared, which in this case is 0.4186. This means that 42% of the variance in gpa is explained by our three x-variables.

When it comes to the B coefficients, they are roughly the same or somewhat lower (i.e. closer to 0) in comparison to the simple regression models. For example, the B coefficient for cognitive is still 0.006. The B coefficient for bullied is lower: -0.07 here instead of -0.23. Concerning the categories of skipped, we see that the B coefficient for Sometimes is still -0.18 and the B coefficient for Often is -0.37 instead of -0.38.

The associations between the x-variables and gpa are still statistically significant (p< 0.05) after mutual adjustment.

Summary
In the fully adjusted model, it can be observed that the associations with grade point average are not altered in any substantial way in comparison to the simple models. To conclude, cognitive test scores, exposure to bullying, and having skipped class are associated with grade point average at a statistically significant level (all: p=0.000). Nonetheless, the associations are generally rather weak.

Estimates table and coefficients plot

If we have multiple models, we can facilitate comparisons between the regression models by asking Stata to construct estimates tables and coefficients plots. What we do is to run the regression models one-by-one, save the estimates after each, and then use the commands estimates table and coefplot.

The coefplot option is not part of the standard Stata program, so unless you already have added this package, you need to install it:

ssc install coefplot

As an example, we can include the three simple regression models as well as the multiple regression model. The quietly option is included in the beginning of the regression commands to suppress the output.

Run and save the first simple regression model:

quietly reg gpa cognitive if pop_linear==1

estimates store model1

Run and save the second simple regression model:

quietly reg gpa bullied if pop_linear==1

estimates store model2

Run and save the third simple regression model:

quietly reg gpa ib1.skipped if pop_linear==1

estimates store model3

Run and save the multiple regression model:

quietly reg gpa cognitive bullied ib1.skipped if pop_linear==1

estimates store model4

Produce the estimates table:

estimates table model1 model2 model3 model4

Produce the coefficients plot:

coefplot model1 model2 model3 model4

Note
You can improve the graph by using the Graph Editor to delete “_cons” as well as to adjust the category and label names.