Function

Written by:

Ylva B Almquist

Basic command

reg depvar indepvars

Explanations
`depvar`	Insert the name of the y-variable.
`indepvars`	Insert the name of the x-variable(s) that you want to use.

Short names
`reg`	Regress

More information
help regress

A walk-through of the output

When we perform a linear regression in Stata, the table looks like this:

In this example, yvar ranges between 0 and 10, whereas xvar1 is a binary (0/1) variable and xvar2 is a continuous variable ranging between 100 and 500.

The upper left part of the table is an ANOVA table which shows distribution of variance. This is what the different columns mean:

Source	The Total variance is partitioned into Model and Residual. The former is the variance that can be explained by the Model, i.e. the x-variable(s) that we include. The latter is the variance which cannot be explained by the model.
SS	The sum of squares (SS) associated with the sources of variance.
Df	The degrees of freedom (df) associated with the sources of variance.
MS	The mean squares (MS), which is the sum of squares divided by the degrees of freedom.

The upper right part shows the overall model fit. This is what the different rows mean:

Number of obs	The number of observations included in the model.
F	F-value, calculated as the mean square model divided by the mean square residual.
Prob > F	The p-value associated with the F-value. If the p-value is below 0.05, it means that the x-variable(s) reliably predict the y-variable.
R-squared	The proportion of variance in the y-var that can be explained by the x-variable(s).
Adj R-squared	Same as R-squared, but accounts for the overlap in the variance explain by each x-variable.
Root MSE	Root mean square error (RMSE). This can be seen as a measure of accuracy (the lower the RMSE, the less errors, i.e. the better the predictive power).

The lower part of the table presents the parameter estimates from the analysis.

	The first column lists the y-variable on top, followed by our x-variable(s). The last row represents the constant (intercept).
Coef.	These are the B coefficients.
Std. Err.	The standard errors associated with the B coefficients.
t	T-value (B coefficient divided by its standard error).
P>\|t\|	P-value.
[95% Conf. Interval]	95% confidence intervals (lower limit and upper limit).

The analytical sample used for the examples

In the subsequent sections, we will use the following variables:

Dataset

StataData1.dta

Variable name	gpa
Variable label	Grade point average (Age 15, Year 1985)
Value labels	N/A

Variable name	cognitive
Variable label	Cognitive test score (Age 15, Year 1985)
Value labels	N/A

Variable name	bullied
Variable label	Exposure to bullying (Age 15, Year 1985)
Value labels	0=No 1=Yes

Variable name	skipped
Variable label	Skipped class (Age 15, Year 1985)
Value labels	1=Never 2=Sometimes 3=Often

sum gpa cognitive bullied skipped

We define our analytical sample through the following command:

gen pop_linear=1 if gpa!=. & cognitive!=. & bullied!=. & skipped!=.

This means that new the variable pop_linear gets the value 1 if the four variables do not have missing information. In this case, we have 8,136 individuals that are included in our analytical sample.

tab pop_linear