Simple linear regression with a categorical (non-binary) x

Theoretical examples

Example 1
We want to investigate the association between educational attainment (x) and income (y). Educational attainment has the values: 1=Compulsory, 2=Upper secondary, and 3=University. We choose Compulsory as our reference category. Income is measured in thousands of Swedish crowns per month and ranges between 20 and 40. Let us say that we get a B coefficient for Upper secondary that is 2.1 and we get a B coefficient for University that is 3.4. In other words, those with upper secondary education have 2100 SEK higher income compared to those with compulsory education, and those with university education have 3400 SEK higher income compared to those with compulsory education.
Example 2
Suppose we are interested in the association between family type (x) and children’s average school marks (y). Family type has three categories: 1=Two-parent household, 2=Joint custody, and 3=Single-parent household. We choose Two-parent household as our reference category. Children’s average school marks range from 1 to 5. The analysis results in a B coefficient of -0.1 for joint custody and a B coefficient of -0.9 for single-parent household. That would mean that children living in joint custody families have a 0.1 point lower score for average school marks compared to those living in two-parent households. Moreover, children living in single-parent households have a 0.9 point lower score for average school marks compared to those living in two-parent households.

Practical example

Dataset
StataData1.dta
Variable namegpa
Variable labelGrade point average (Age 15, Year 1985)
Value labelsN/A
Variable nameskipped
Variable labelSkipped class (Age 15, Year 1985)
Value labels1=Never
2=Sometimes
3=Often

sum gpa skipped if pop_linear==1

The variable skipped has three categories: 1=Never, 2=Sometimes, and 3=Often.
Here, we (with ib1) specify that the first category (Never) will be the reference category.

reg gpa ib1.skipped if pop_linear==1

R-squared is 0.04. Thus, skipped only explains 4% of the variance in gpa.

With regard to the B coefficient, we get two: one for skipped: Sometimes and one for skipped: Often. They are compared to the reference category skipped: Never. The reference group in linear regression always has a B coefficient of 0.00. In this case we can see that the B coefficient for Sometimes is -0.18, and for Often it is -0.38. Put differently, the more the individuals have skipped class, the lower the grade point average. 

Both Sometimes and Often have p-values that are below 0.05 (0.000) and the 95% confidence intervals are -0.21 to -0.15 and -0.42 to -0.33, respectively. Thus, there is a statistically significant difference in gpa between Sometimes and Never, and between Often and Never.

Test the overall effect

The output presented and interpreted above, is based on the coefficients for the dummy variables of skipped. But what about the overall statistical effect of skipped on gpa? We can assess it through contrast, which is a postestimation command.

contrast p.skipped, noeffects

Here, we focus on the row for linear, which shows a p-value (P>chi2) below 0.05. This suggests that we have a statistically significant trend in gpa according to skipped.

More information
help contrast

We will also produce a graph of the trend. First, however, we need to apply the post-estimation command margins.

Note
This command can also be used for variables that are continuous or binary, but is particularly useful for categorical, non-binary (i.e. ordinal) variables.

margins skipped

Note that the estimate for Never in the column Margin is exactly reflecting the constant from the linear regression analysis (3.348289). Adding the B coefficient for Sometimes (-0.1792257), we end up with the estimate for Sometimes in this table (3.169063). Adding the B coefficient for Often (-0.376323), we get the estimate for Often in this table (2.971966).

marginsplot

Note
The y-axis shows predicted values (i.e. not B coefficients).
More information
help marginsplot

Summary
Among 15-year-olds, there is a negative and statistically significant association between having skipped class and grade point average. The association is graded: those who skipped class sometimes have a lower grade point average (B=-0.18, 95% CI=-0.21 to -0.15) and those who skipped class often have even lower (B=-0.38, 95% CI=-0.42 to -0.33), compared to those who never skipped class.