Correlation, Pearson Show
The Pearson correlation coefficient (also known as Pearson product-moment correlation coefficient) r is a measure to determine the relationship (instead of difference) between two quantitative variables (interval/ratio) and the degree to which the two variables coincide with one another—that is, the extent to which two variables are linearly related: changes in one variable correspond to changes in another variable. In fact, a variety of different correlation coefficients (such as phi correlation coefficient, point-biserial correlation, Spearman’s rho, partial correlation, and part correlation) have been developed over the years for measuring relationships between sets of data, and the Pearson correlation coefficient (also referred to Pearson’s r) is the most common measure of correlation and has been widely used in the sciences as a measure of ... locked icon Sign in to access this contentSign in Get a 30 day FREE TRIAL
sign up today! We’re going to consider ‘correlation’ in this chapter. A correlation is a statistical measure of an association between two variables. An association is any relationship between the variables that makes them dependent in some way: knowing the value of one variable gives you information about the possible values of the other. The terms ‘association’ and ‘correlation’ are often used interchangeably but strictly speaking correlation has a narrower definition. A correlation quantifies, via a measure called a correlation coefficient, the degree to which an association tends to a certain pattern. For example, the correlation coefficient studied below—Pearson’s correlation coefficient—measures the degree to which two variables tend toward a straight line relationship. There are different methods for quantifying correlation but these all share a number of properties:
We’re going to make sense of all this by studying one particular correlation coefficient in this chapter: Pearson’s product-moment correlation coefficient (\(r\)). Various different measures of association exist so why focus on this one? Well… once you know how to work with one type of correlation in R it isn’t hard to use another. Pearson’s product-moment correlation coefficient is the most well-known, which means it is as good a place as any to start learning about correlation analysis. Pearson’s product-moment correlationWhat do we need to know about Pearson’s product-moment correlation? Let’s start with the naming conventions. People often use “Pearson’s correlation coefficient” or “Pearson’s correlation” as a convenient shorthand because writing “Pearson’s product-moment correlation coefficient” all the time soon becomes tedious. If we want to be really concise we use the standard mathematical symbol to denote Pearson’s correlation coefficient—lower case ‘\(r\).’ The one thing we absolutely have to know about Pearson’s correlation coefficient is that it is a measure of linear association between numeric variables. This means Pearson’s correlation is appropriate when numeric variables follow a ‘straight-line’ relationship. That doesn’t mean they have to be perfectly related, by the way. It simply means there shouldn’t be any ‘curviness’ to their pattern of association5. Finally, calculating Pearson’s correlation coefficient serves to estimate the strength of an association. An estimate can’t tell us whether that association is likely to be ‘real’ or not. We need a statistical test to tackle that question. There is a standard parametric test associated with Pearson’s correlation coefficient. Unfortunately, this does not have its own name. We will call it “Pearson’s correlation test” to distinguish the test from the actual correlation coefficient. Just keep in mind these are not ‘official’ names. Pearson’s correlation testThe logic underpinning Pearson’s correlation test is the same as we’ve seen in previous tests: define a null hypothesis, calculate an appropriate test statistic, work out the null distribution of that statistic, and then use this to calculate a p-value from the observed coefficient. We won’t work through the details other than to note a few important aspects:
Like any parametric technique, Pearson’s correlation test makes a number of assumptions. These need to be met in order for the statistical test to be reliable. The assumptions are:
The first two requirements should not need any further explanation at this point—we’ve seen them before in the context of the one- and two-sample t-tests. The third one obviously stems from Pearson’s correlation coefficient being a measure of linear association. Only the linearity assumption needs to be met for Pearson’s correlation coefficient (\(r\)) to be a valid measure of association. As long as the relationship between two variables is linear, \(r\) produces a sensible measure of association. However, the first two assumptions need to be met for the associated statistical test to be appropriate. That’s enough background and abstract concepts. Let’s see how to perform correlation analysis in R using Pearson’s correlation coefficient. Pearson’s product-moment correlation coefficient in R Work through the example in this section. We’ll be using a new data set ‘BRACKEN.CSV’ here so you shuold start a new script. Everything below assumes the data in ‘BRACKEN.CSV’ has been read into an R data frame with the name
The plant morph example is not suitable for correlation analysis. We need a new example to motivate a work flow for correlation tests in R. The example we’re going use is about the association between ferns and heather… Bracken fern (Pteridium aquilinum) is a common plant in many upland areas. A land manager need to know whether there is any association between bracken and heather (Calluna vulgaris) in these areas. To determine whether the two species are associated, she sampled 22 plots at random and estimated the density of bracken and heather in each plot. The data are the mean Calluna standing crop (g m-2) and the number of bracken fronds per m2. Visualising the data and checking the assumptionsThe data are in the file BRACKEN.CSV. Read these data into a data frame, calling it
There are 22 observations (rows) and two variables (columns) in this data set. The two variables, We should always explore the data thoroughly before carrying out any kind of statistical analysis. To begin, we can visualise the form of the association with a scatter plot:
There appears to be a strong negative association between the species’ abundances, and the relationship seems to follow a ‘straight line’ pattern. It looks like Pearson’s correlation is a reasonable measure of association for these data. We will confirm this with a significance test. Is it appropriate to carry out the test? We’re dealing with numeric variables measured in ratio scale (assumption 1). What about their distributions (assumptions 2)? Here’s a quick visual summary:
These dot plots suggest the normality assumption is met, i.e. both distributions are roughly ‘bell-shaped.’ That’s all three assumptions met—the variables are on a ratio scale, dot the normality assumption is met, and the abundance relationship is linear. It looks like the statistical test will give reliable results. Doing the testLet’s proceed with the analysis… Carrying out a correlation analysis in R is straightforward. We use the
We have suppressed the output for now to focus on how the function works:
Notice The output from the
We won’t step through most of this as its meaning should be clear. The
What is the actual correlation between bracken and heather densities? That’s given at the bottom of the test output: \(-0.76\). As expected from the scatter plot, there is quite a strong negative association between bracken and heather densities. Reporting the resultWhen using Pearson’s method we report the value of the correlation coefficient, the sample size, and the p-value6. Here’s how to report the results of this analysis:
Next stepsNotice that when we summarised the result we did not say that bracken is having a negative effect on the heather, or vice versa. It might well be true that bracken has a negative effect on heather. However, our correlation analysis only characterises the association between bracken and heather7. If we want to make statements about one species is related to the other we need use a different kind of analysis. That’s the focus of our next topic: regression. What does the Pearson productDefinition. The Pearson product-moment correlation coefficient is a measure of the linear relationship between two questions/measures/variables, X and Y. The correlation value can range from +1 to -1. A positive correlation (e.g., +0.32) means there is a positive relationship between X and Y.
What kind of variables do you need to use with Pearson's correlation coefficient?When should I use the Pearson correlation coefficient? You should use the Pearson correlation coefficient when (1) the relationship is linear and (2) both variables are quantitative and (3) normally distributed and (4) have no outliers.
|