Standardization: z-scores

The standard score – or the z-score – is very useful when we have continuous (ratio/interval) variables with different normal distributions (see Distributions for more information about distributions).

For example, if we have one variable called income (measured as annual household income in Swedish crowns) and another variable called years of schooling (measured as the total number of years spent in the educational system), these variables obviously have very different distributions.

Suppose we want to compare which one – income or years of schooling – has a larger statistical effect on our outcome. That is not possible using the variables we have. The solution is to standardize (i.e. calculate z-scores for) these two variables so that they are comparable.

Z-scores are expressed in terms of standard deviations from the mean.

What we do is that we take a variable and “rescale” it so that it has a mean of 0 and a standard deviation of 1.

Each individual’s value on the standardized variable indicates its difference from the mean of the original (unstandardized) variable in number of standard deviations.

A value of 1.5 would thus suggest that this individual has a value that is 1½ standard deviations above the mean, whereas a value of -2 would suggest that this individual has a value that is 2 standard deviations below the mean.

Function

Basic command
egen newvarname=std(oldvarname)
Explanations
newvarnameInsert the name of the new variable.
oldvarnameInsert the name of the old variable.
stdStandard deviation
More information
help egen

Practical example

Dataset
StataData1.dta
Variable namegpa
Variable labelGrade point average (Age 15, Year 1985)
Value labelsN/A
Variable namecognitive
Variable labelCognitive test score (Age 15, Year 1985)
Value labelsN/A
egen z_gpa=std(gpa)
egen z_cognitive=std(cognitive)

Now you have new versions – containing z-scores – of the two variables.

sum gpa z_gpa cognitive z_cognitive
codebook gpa z_gpa cognitive z_cognitive, compact