Correlation analysis

Quick facts

Number of variables
Two or more

Scales of variable(s)
Continuous

A correlation analysis tests the relationship between two continuous variables in terms of: a) how strong the relationship is, and b) in what direction the relationship goes. The strength of the relationship is given as a coefficient (the Pearson product-moment correlation coefficient, or simply Pearson’s r) which can be anything between -1 and 1. But how do we know if the relationship is strong or weak? This is not an exact science, but here is one rule of thumb:

NegativePositive 
-11Perfect
-0.9 to -0.70.7 to 0.9Strong
-0.6 to -0.40.4 to 0.6Moderate
-0.3 to -0.10.1 to 0.3Weak
00Zero

Thus, the coefficient can be negative or positive. These terms, “negative” and “positive”, are not the same as good and bad (e.g. excellent health or poor health; high income or low income). They merely reflect the direction of the relationship.

NegativeAs the values of Variable 1 increases, the values of Variable 2 decreases
PositiveAs the values of Variable 1 increases, the values of Variable 2 increases

Note
Correlation analysis does not imply anything about causality: Variable 1 does not cause Variable 2 (or vice versa). The correlation analysis only says something about the degree to which the two variables co-vary (in a linear fashion).

Assumptions

First, you have to check your data to see that the assumptions behind the correlation analysis hold. If your data “passes” these assumptions, you will have a valid result.

Checklist

Two continuous variablesBoth variables should be continuous (i.e. interval/ratio). For example: Income, height, weight, number of years of schooling, and so on. Although they are not really continuous, it is still rather common to use ratings as continuous variables, such as: “How satisfied with your income are you?” (on a scale 1-10) or “To what extent do you agree with the previous statement?” (on a scale 1-5).
Normal distributionBoth variables need to be approximately normally distributed. Use a histogram to check (see Histogram).
Linear relationship between the two variablesThere needs to be a linear relationship between your two variables. You can check this by creating a scatterplot (described in Scatterplot).
No outliersAn outlier is an extreme (low or high) value. For example, if most individuals have a test score between 40 and 60, but one individual has a score of 96 or another individual has a score of 1, this will distort the test.

Function alternative 1

Basic command
corr varname1 varname2
Explanations
varname1Insert the name of the first variable you want to use.
varname2Insert the name of the second variable you want to use.
Short names
corrCorrelate
Note
You can include more than two variables at the same time in the analysis.
More information
help correlate

Function alternative 2

Basic command
pwcorr varname1 varname2
Useful options
pwcorr varname1 varname2, sig
pwcorr varname1 varname2, star(level)
Explanations
varname1Insert the name of the first variable you want to use.
varname2Insert the name of the second variable you want to use.
sigPrint a p-value for each entry.
star(level)Denote statistically significant entries with an asterisk (*). Change “level” to the preferred significance level (e.g. 0.05, 0.01, 0.001).
Note
Options can be used simultaneously, e.g.:
pwcorr varname1 varname2, sig star(level)
You can include more than two variables at the same time in the analysis.
More information
help pwcorr

There are two alternative commands if you want to do a correlation analysis in Stata: corr and pwcorr. The first difference between these commands has to do with how Stata handles missing values, and is only relevant if you include more than two variables in the analysis. In that case, corr will use listwise deletion (i.e. removing all observations that have missing information from any of the included variables), whereas pwcorr uses pairwise deletion (i.e. only removing observations with missing values for each specific pair of variables). The second difference is that they have different options (with the options for pwcorr being slightly more useful).

Since we highly recommend that you restrict your analysis to a sample with only valid information for all study variables anyway, it does not matter whether you would go for corr or pwcorr. But since we like the options to include p-values and asterisks, we will base our following example on pwcorr.

Practical example

Dataset
StataData1.dta
Variable namegpa
Variable labelGrade point average (Age 15, Year 1985)
Value labelsN/A
Variable namecognitive
Variable labelCognitive test score (Age 15, Year 1985)
Value labelsN/A

pwcorr gpa cognitive, sig star(0.05)

In the diagonal, we can see the perfect (and totally irrelevant) correlations between gpa and gpa, and between cognitive and cognitive. What is interesting here is the correlation coefficient between cognitive and gpa: 0.6276. According to our earlier specified rules of thumb, this would be a moderately strong correlation (close to strong). We get a p-value of 0.0000, which is lower than p<0.05 (as we can also note this by the asterisk). Thus, the correlation between cognitive test score and grade point average is statistically significant.