A correlation analysis tests the relationship between two continuous variables in terms of: a) how strong the relationship is, and b) in what direction the relationship goes. The strength of the relationship is given as a coefficient (the Pearson product-moment correlation coefficient, or simply Pearson’s r) which can be anything between -1 and 1. But how do we know if the relationship is strong or weak? This is not an exact science, but here is one rule of thumb:
Negative
Positive
-1
1
Perfect
-0.9 to -0.7
0.7 to 0.9
Strong
-0.6 to -0.4
0.4 to 0.6
Moderate
-0.3 to -0.1
0.1 to 0.3
Weak
0
0
Zero
Thus, the coefficient can be negative or positive. These terms, “negative” and “positive”, are not the same as good and bad (e.g. excellent health or poor health; high income or low income). They merely reflect the direction of the relationship.
Negative
As the values of Variable 1 increases, the values of Variable 2 decreases
Positive
As the values of Variable 1 increases, the values of Variable 2 increases
Note Correlation analysis does not imply anything about causality: Variable 1 does not cause Variable 2 (or vice versa). The correlation analysis only says something about the degree to which the two variables co-vary (in a linear fashion).
Assumptions
First, you have to check your data to see that the assumptions behind the correlation analysis hold. If your data “passes” these assumptions, you will have a valid result.
Checklist
Two continuous variables
Both variables should be continuous (i.e. interval/ratio). For example: Income, height, weight, number of years of schooling, and so on. Although they are not really continuous, it is still rather common to use ratings as continuous variables, such as: “How satisfied with your income are you?” (on a scale 1-10) or “To what extent do you agree with the previous statement?” (on a scale 1-5).
Normal distribution
Both variables need to be approximately normally distributed. Use a histogram to check (see Histogram).
Linear relationship between the two variables
There needs to be a linear relationship between your two variables. You can check this by creating a scatterplot (described in Scatterplot).
No outliers
An outlier is an extreme (low or high) value. For example, if most individuals have a test score between 40 and 60, but one individual has a score of 96 or another individual has a score of 1, this will distort the test.
Function alternative 1
Basic command
corr varname1 varname2
Explanations
varname1
Insert the name of the first variable you want to use.
varname2
Insert the name of the second variable you want to use.
Short names
corr
Correlate
Note You can include more than two variables at the same time in the analysis.
More information help correlate
Function alternative 2
Basic command
pwcorr varname1 varname2
Useful options
pwcorr varname1 varname2, sig pwcorr varname1 varname2, star(level)
Explanations
varname1
Insert the name of the first variable you want to use.
varname2
Insert the name of the second variable you want to use.
sig
Print a p-value for each entry.
star(level)
Denote statistically significant entries with an asterisk (*). Change “level” to the preferred significance level (e.g. 0.05, 0.01, 0.001).
Note Options can be used simultaneously, e.g.: pwcorr varname1 varname2, sig star(level) You can include more than two variables at the same time in the analysis.
More information help pwcorr
There are two alternative commands if you want to do a correlation analysis in Stata: corr and pwcorr. The first difference between these commands has to do with how Stata handles missing values, and is only relevant if you include more than two variables in the analysis. In that case, corr will use listwise deletion (i.e. removing all observations that have missing information from any of the included variables), whereas pwcorr uses pairwise deletion (i.e. only removing observations with missing values for each specific pair of variables). The second difference is that they have different options (with the options for pwcorr being slightly more useful).
Since we highly recommend that you restrict your analysis to a sample with only valid information for all study variables anyway, it does not matter whether you would go for corr or pwcorr. But since we like the options to include p-values and asterisks, we will base our following example on pwcorr.
Practical example
Dataset
StataData1.dta
Variable name
gpa
Variable label
Grade point average (Age 15, Year 1985)
Value labels
N/A
Variable name
cognitive
Variable label
Cognitive test score (Age 15, Year 1985)
Value labels
N/A
pwcorr gpa cognitive, sig star(0.05)
In the diagonal, we can see the perfect (and totally irrelevant) correlations between gpa and gpa, and between cognitive and cognitive. What is interesting here is the correlation coefficient between cognitive and gpa: 0.6276. According to our earlier specified rules of thumb, this would be a moderately strong correlation (close to strong). We get a p-value of 0.0000, which is lower than p<0.05 (as we can also note this by the asterisk). Thus, the correlation between cognitive test score and grade point average is statistically significant.