Insert the name of the x-variable(s) that you want to use.
Short names
reg
Regress
More information help regress
A walk-through of the output
When we perform a linear regression in Stata, the table looks like this:
In this example, yvar ranges between 0 and 10, whereas xvar1 is a binary (0/1) variable and xvar2 is a continuous variable ranging between 100 and 500.
The upper left part of the table is an ANOVA table which shows distribution of variance. This is what the different columns mean:
Source
The Total variance is partitioned into Model and Residual. The former is the variance that can be explained by the Model, i.e. the x-variable(s) that we include. The latter is the variance which cannot be explained by the model.
SS
The sum of squares (SS) associated with the sources of variance.
Df
The degrees of freedom (df) associated with the sources of variance.
MS
The mean squares (MS), which is the sum of squares divided by the degrees of freedom.
The upper right part shows the overall model fit. This is what the different rows mean:
Number of obs
The number of observations included in the model.
F
F-value, calculated as the mean square model divided by the mean square residual.
Prob > F
The p-value associated with the F-value. If the p-value is below 0.05, it means that the x-variable(s) reliably predict the y-variable.
R-squared
The proportion of variance in the y-var that can be explained by the x-variable(s).
Adj R-squared
Same as R-squared, but accounts for the overlap in the variance explain by each x-variable.
Root MSE
Root mean square error (RMSE). This can be seen as a measure of accuracy (the lower the RMSE, the less errors, i.e. the better the predictive power).
The lower part of the table presents the parameter estimates from the analysis.
The first column lists the y-variable on top, followed by our x-variable(s). The last row represents the constant (intercept).
Coef.
These are the B coefficients.
Std. Err.
The standard errors associated with the B coefficients.
t
T-value (B coefficient divided by its standard error).
P>|t|
P-value.
[95% Conf. Interval]
95% confidence intervals (lower limit and upper limit).
The analytical sample used for the examples
In the subsequent sections, we will use the following variables:
Dataset
StataData1.dta
Variable name
gpa
Variable label
Grade point average (Age 15, Year 1985)
Value labels
N/A
Variable name
cognitive
Variable label
Cognitive test score (Age 15, Year 1985)
Value labels
N/A
Variable name
bullied
Variable label
Exposure to bullying (Age 15, Year 1985)
Value labels
0=No 1=Yes
Variable name
skipped
Variable label
Skipped class (Age 15, Year 1985)
Value labels
1=Never 2=Sometimes 3=Often
sum gpa cognitive bullied skipped
We define our analytical sample through the following command:
gen pop_linear=1 if gpa!=. & cognitive!=. & bullied!=. & skipped!=.
This means that new the variable pop_linear gets the value 1 if the four variables do not have missing information. In this case, we have 8,136 individuals that are included in our analytical sample.