The Coefficient of Determination

Brianne Wragg
December 31, 2021

If the regression line passes exactly through every point on the scatter plot, it would be able to explain all of the variations. The further the line is away from the points, the less it is able to explain. The positive sign of r tells us that the relationship is positive — as number of stories coefficient of determination vs correlation coefficient increases, height increases — as we expected. Because r is close to 1, it tells us that the linear relationship is very strong, but not perfect. The r2 value tells us that 90.4% of the variation in the height of the building is explained by the number of stories in the building. The correlation of 2 random variables \(A\) and \(B\) is the strength of the linear relationship between them.

1 The Coefficient of Correlation

The correlation coefficient, \(r\), quantifies the strength of the linear relationship between two variables, \(x\) and \(y\), similar to the way the least squares slope, \(b_1\), does. This means that the value of \(r\) always falls between \(\pm 1\), regardless of the units used for \(x\) and \(y\). Lets say you are performing a regression task (regression in general, not just linear regression). You have some response variable \(y\), some predictor variables \(X\), and you’re designing a function \(f\) such that \(f(X)\) approximates \(y\). There are definitely some benefits to this – correlation is on the easy to reason about scale of -1 to 1, and it generally becomes closer to 1 as \(f(X)\) looks more like \(y\).

In conclusion, the coefficient of determination and the coefficient of correlation stand as pillars of statistical analysis, each offering unique insights into the intricate tapestry of relationships within data.
It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
Seen in this light, the coefficient of determination, the complementary proportion of the variability in y, is the proportion of the variability in all the y measurements that is accounted for by the linear relationship between x and y.
Because r is quite close to 0, it suggests — not surprisingly, I hope — that there is next to no linear relationship between height and grade point average.

This can arise when the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data. Imagine we’re studying the relationship between hours spent studying and exam scores. By calculating the correlation coefficient, we can discern whether there’s a linear relationship between the two variables.

We see that 93.53% of the variability in the volume of the trees can be explained by the linear model using girth to predict the volume. Example 5.3 (Example 5.2 revisited) We can find the coefficient of determination using the summary function with an lm object. The correlation \(r\) is for the observed data which is usually from a sample. The calculation of \(r\) uses the same data that is used to fit the least squares line.

Coefficient of Correlation:

Because increases in the number of regressors increase the value of R2, R2 alone cannot be used as a meaningful comparison of models with very different numbers of independent variables.
In contrast, the coefficient of determination (R²) represents the variance proportion in the dependent variable explained by the independent variable, generally ranging from 0 (no explained variance) to 1 (complete explained variance).
The coefficient of determination is a measure of how well the regression line represents the data.
A high R2 indicates a lower bias error because the model can better explain the change of Y with predictors.
Meanwhile, to accommodate fewer assumptions, the model tends to be more complex.

Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones. The only real difference between the least squares slope \(b_1\) and the coefficient of correlation \(r\) is the measurement scale2. Ingram Olkin and John W. Pratt derived the minimum-variance unbiased estimator for the population R2,19 which is known as Olkin–Pratt estimator. Comparisons of different approaches for adjusting R2 concluded that in most situations either an approximate version of the Olkin–Pratt estimator 18 or the exact Olkin–Pratt estimator 20 should be preferred over (Ezekiel) adjusted R2. Where p is the total number of explanatory variables in the model (excluding the intercept), and n is the sample size.

Coefficient of Determination vs. Coefficient of Correlation in Data Analysis

The correlation between wine consumption and heart disease deaths (0.71) is an ecological correlation. The correlation between skin cancer mortality and state latitude of 0.68 is also an ecological correlation. In both cases, we should not use these correlations to try to draw a conclusion about how an individual’s wine consumption or suntanning behavior will affect their individual risk of dying from heart disease or skin cancer. We shouldn’t try to draw such conclusions anyway, because “association is not causation.”

In contrast, the coefficient of determination (R²) represents the variance proportion in the dependent variable explained by the independent variable, generally ranging from 0 (no explained variance) to 1 (complete explained variance). R² is often expressed as the square of the correlation coefficient (r), but this is a simplification. The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below. In least squares regression using typical data, R2 is at least weakly increasing with an increase in number of regressors in the model. Because increases in the number of regressors increase the value of R2, R2 alone cannot be used as a meaningful comparison of models with very different numbers of independent variables. For a meaningful comparison between two models, an F-test can be performed on the residual sum of squares citation needed, similar to the F-tests in Granger causality, though this is not always appropriatefurther explanation needed.

Coefficient of Determination vs. Coefficient of Correlation:

One of the ways to determine the answer to this question is to exam the correlation coefficient and the coefficient of determination. Before we dive into the specifics, let’s establish a foundational understanding. Both coefficients are about relationships in data, but they answer different questions. The Coefficient of Correlation tells us about the direction and strength of a relationship between two variables, while the Coefficient of Determination reveals how well a variable can predict another. Because r is quite close to 0, it suggests — not surprisingly, I hope — that there is next to no linear relationship between height and grade point average.

In a multiple linear model

For example, the practice of carrying matches (or a lighter) is correlated with incidence of lung cancer, but carrying matches does not cause cancer (in the standard sense of “cause”). The explanation of this statistic is almost the same as R2 but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the R2 statistic can be calculated as above and may still be a useful measure.

2.1 Proportion of Variation Explained

There are also some glaring negatives – the scale of \(f(X)\) can be wildly different from that of \(y\) and correlation can still be large. The adjusted R2 can be negative, and its value will always be less than or equal to that of R2. Unlike R2, the adjusted R2 increases only when the increase in R2 (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance.

Comparison with residual statistics

With more than one regressor, the R2 can be referred to as the coefficient of multiple determination. In this form R2 is expressed as the ratio of the explained variance (variance of the model’s predictions, which is SSreg / n) to the total variance (sample variance of the dependent variable, which is SStot / n). A value of +1 indicates a perfect positive relationship, -1 a perfect negative relationship, and 0 no relationship. In simpler terms, it shows whether and how strongly two variables move together. Considering the calculation of R2, more parameters will increase the R2 and lead to an increase in R2.

It measures the proportion of the variability in y that is accounted for by the linear relationship between x and y. If we want to find the correlation coefficient, we can just use the cor function on the dataframe. This will find the correlation coefficient for each pair of variables in the dataframe. Note that there can only be quantitative variables in the dataframe in order this function to work. Why do we take the squared differences and simply not the absolute differences?

Based on bias-variance tradeoff, a higher complexity will lead to a decrease in bias and a better performance (below the optimal line). In R2, the term (1 − R2) will be lower with high complexity and resulting in a higher R2, consistently indicating a better performance. The negative sign of r tells us that the relationship is negative — as driving age increases, seeing distance decreases — as we expected. Because r is fairly close to -1, it tells us that the linear relationship is fairly strong, but not perfect. The r2 value tells us that 64.2% of the variation in the seeing distance is reduced by taking into account the age of the driver.