Correlation analysis

Written by:

Ylva B Almquist

Quick facts

Number of variables
Two or more

Scales of variable(s)
Continuous

A correlation analysis tests the relationship between two continuous variables in terms of: a) how strong the relationship is, and b) in what direction the relationship goes. The strength of the relationship is given as a coefficient (the Pearson product-moment correlation coefficient, or simply Pearson’s r) which can be anything between -1 and 1. But how do we know if the relationship is strong or weak? This is not an exact science, but here is one rule of thumb:

Negative	Positive
-1	1	Perfect
-0.9 to -0.7	0.7 to 0.9	Strong
-0.6 to -0.4	0.4 to 0.6	Moderate
-0.3 to -0.1	0.1 to 0.3	Weak
0	0	Zero

Thus, the coefficient can be negative or positive. These terms, “negative” and “positive”, are not the same as good and bad (e.g. excellent health or poor health; high income or low income). They merely reflect the direction of the relationship.

Negative	As the values of Variable 1 increases, the values of Variable 2 decreases
Positive	As the values of Variable 1 increases, the values of Variable 2 increases

Note
Correlation analysis does not imply anything about causality: Variable 1 does not cause Variable 2 (or vice versa). The correlation analysis only says something about the degree to which the two variables co-vary (in a linear fashion).

Assumptions

First, you have to check your data to see that the assumptions behind the correlation analysis hold. If your data “passes” these assumptions, you will have a valid result.

Checklist

Two continuous variables	Both variables should be continuous (i.e. interval/ratio). For example: Income, height, weight, number of years of schooling, and so on. Although they are not really continuous, it is still rather common to use ratings as continuous variables, such as: “How satisfied with your income are you?” (on a scale 1-10) or “To what extent do you agree with the previous statement?” (on a scale 1-5).
Normal distribution	Both variables need to be approximately normally distributed. Use a histogram to check (see Histogram).
Linear relationship between the two variables	There needs to be a linear relationship between your two variables. You can check this by creating a scatterplot (described in Scatterplot).
No outliers	An outlier is an extreme (low or high) value. For example, if most individuals have a test score between 40 and 60, but one individual has a score of 96 or another individual has a score of 1, this will distort the test.

Function alternative 1

Basic command

corr varname1 varname2

Explanations
`varname1`	Insert the name of the first variable you want to use.
`varname2`	Insert the name of the second variable you want to use.

Short names
`corr`	Correlate

Note
You can include more than two variables at the same time in the analysis.

More information
help correlate

Function alternative 2

Basic command

pwcorr varname1 varname2

Useful options

pwcorr varname1 varname2, sig 
pwcorr varname1 varname2, star(level)

Explanations
`varname1`	Insert the name of the first variable you want to use.
`varname2`	Insert the name of the second variable you want to use.
`sig`	Print a p-value for each entry.
`star(level)`	Denote statistically significant entries with an asterisk (*). Change “level” to the preferred significance level (e.g. 0.05, 0.01, 0.001).

Note
Options can be used simultaneously, e.g.:
pwcorr varname1 varname2, sig star(level)
You can include more than two variables at the same time in the analysis.

More information
help pwcorr

There are two alternative commands if you want to do a correlation analysis in Stata: corr and pwcorr. The first difference between these commands has to do with how Stata handles missing values, and is only relevant if you include more than two variables in the analysis. In that case, corr will use listwise deletion (i.e. removing all observations that have missing information from any of the included variables), whereas pwcorr uses pairwise deletion (i.e. only removing observations with missing values for each specific pair of variables). The second difference is that they have different options (with the options for pwcorr being slightly more useful).

Since we highly recommend that you restrict your analysis to a sample with only valid information for all study variables anyway, it does not matter whether you would go for corr or pwcorr. But since we like the options to include p-values and asterisks, we will base our following example on pwcorr.

Practical example

Dataset

StataData1.dta

Variable name	gpa
Variable label	Grade point average (Age 15, Year 1985)
Value labels	N/A

Variable name	cognitive
Variable label	Cognitive test score (Age 15, Year 1985)
Value labels	N/A

pwcorr gpa cognitive, sig star(0.05)

In the diagonal, we can see the perfect (and totally irrelevant) correlations between gpa and gpa, and between cognitive and cognitive. What is interesting here is the correlation coefficient between cognitive and gpa: 0.6276. According to our earlier specified rules of thumb, this would be a moderately strong correlation (close to strong). We get a p-value of 0.0000, which is lower than p<0.05 (as we can also note this by the asterisk). Thus, the correlation between cognitive test score and grade point average is statistically significant.