How to Read a Residual Vs Fitted Plot
Introduction
This fix of supplementary notes provides further discussion of the diagnostic plots that are output in R when y'all run th plot()
function on a linear model (lm
) object.
1. Residual vs. Fitted plot
The ideal case
Allow's begin by looking at the Rest-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. What are those assumptions? In the ideal example, we expect the \(i\)th data point to be generated equally:
\[y_i = \beta_0 + \beta_1x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\]
where \(\epsilon_i\) is an "fault" term with mean 0 and some variance \(\sigma^2\).
To create an case of this blazon of model, permit'due south generate data according to
\[y_i = 3 + 0.1 10 + \epsilon_i,\]
for \(i = ane, 2, \ldots, 1000\), where the \(\epsilon_i\) are contained Normal\((0,sd = 3)\) variables (with standard difference 3).
Hither's code to generate this data and then backslide y on 10.
north <- 1000 # sample size ten <- runif(n, min = 0, max = 100) y.expert <- 3 + 0.1 * x + rnorm(n, sd = 3) # Scatterplot of the data with regression line overlaid qplot(ten, y.expert, ylab = "y", main = "Ideal regression setup") + stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# Run regression and brandish remainder-fitted plot lm.good <- lm(y.good ~ x) plot(lm.good, which = 1)
The scatterplot shows the perfect setup for a linear regression: The data announced to be well modeled by a linear relationship between \(y\) and \(ten\), and the points announced to be randomly spread out about the line, with no discernible non-linear trends or indications of not-constant variance.
When nosotros expect at the diagnostic plots, nosotros'll come across perfect behavior. The quantities that enter into the diagnostic plot are:
- Fitted values: \(\hat y_i = \hat \beta_0 + \lid \beta_1x_{1i} + \cdot \lid\beta_p x_{pi}\)
- Here, \(\chapeau \beta_j\) is the estimated value of the coefficient for variable \(j\)
- Residuals: \(r_i = y_i - \hat y_i\)
You lot can think of the residuals \(r_i\) as being estimates of the error terms \(\epsilon_i\). So anytime nosotros're looking at a plot that involves residuals, we're doing and then because we're trying to assess whether some supposition about the errors \(\epsilon_i\) appears to hold in our data.
Looking at the Residuals vs Fitted plot (showing \(r_i\) on the y-axis and \(\chapeau y_i\) on the x-axis), we see that the blood-red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) is perfectly flat. This tells us that in that location is no discernible non-linear trend to the residuals. Furthermore, the residuals announced to be equally variable across the unabridged range of fitted values. There is no indication of non-constant variance.
# Display scale-location plot plot(lm.good, which = 3)
The scale-location plot is a more sensitive approach to looking for deviations from the constant variance assumption. If y'all encounter significant trends in the red line on this plot, it tells you that the residuals (and hence errors) have non-constant variance. That is, the supposition that all the \(\epsilon_i\) have the aforementioned variance \(\sigma^2\) is not true. When you encounter a flat line like what's shown above, it means your errors have constant variance, like nosotros want to see.
Curvature or not-linear trends
Here's an instance where nosotros have non-linear trends in the information. This instance is constructed to mimic seasonal data.
## `geom_smooth()` using formula 'y ~ x' ## `geom_smooth()` using formula 'y ~ x'
The blue line shows the model fit. The carmine curve is a non-linear fit that does a better chore of modelling the average value of \(y\) at each value of \(x\). Note that the linear model fails to capture the articulate non-linear trend that's present in the information. This causes tremendous problems for our inference. Look at the greyness confidence band that surrounds the regression line. If the standard linear regression assumptions are satisfied, this band with high likelihood would contain the average value of \(y\) at each value of \(x\). i.e., the grey bands effectually the blueish curve should more often than not incorporate the red bend. This obviously does not happen. The red curve is almost always far outside the greyness bands around the bluish regression line.
Take-away: When one or more of the model assumptions underlying the linear model is violated, we can no longer believe our inferential procedures. E.g., our confidence intervals and p-values may no longer exist reliable.
Here'due south what the Residual - Fitted plot looks like for this model.
lm.curved <- lm(y.curved ~ x) plot(lm.curved, which = 1)
Visually, we see a clear tendency in the residuals. They take a periodic tendency. Unfortunately, the scatterplot smoother that's used to construct the ruby-red line isn't doing a good job here. This is a example where the choice of neighbourhood size (how many points get into calculating the local average) is taken to exist as well big to capture the the trend that nosotros visually observe. Don't e'er trust that blood-red curve.
Constructing your own Balance vs Fitted plot
Here's a better version of the default plot.
# Plot model residuals on y axis, fitted values on x axis # Add reddish trend bend with better choice of smoothing bandwidth qplot(y = lm.curved$residuals, x = lm.curved$fitted.values, ylab = "Residuals", xlab = "Fitted values", main = "The Do-it-yourself Residuals vs. Fitted plot") + stat_smooth(method = "loess", span = 0.1, color = I("red"), se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Non-constant variance
In this example we'll generate data where the error variance increases with \(10\). Our model will be:
\[ y_i = iii + 0.2x_i + \epsilon_i \] where \[\epsilon_i \sim N(0, 9(1 + x/25)^ii)\].
y.increasing <- 3 + 0.ii * ten + (1 + x / 25) * rnorm(n, sd = 3) # Produce scatterplot of y vs x qplot(x, y.increasing, ylab = "y")
Here's what the Residue vs. Fitted plot looks like in this case.
lm.increasing <- lm(y.increasing ~ 10) plot(lm.increasing, which = 1)
If you expect at this plot, you'll see that there'southward a clear "funneling" phenomenon. The distribution of the residuals is quite well concentrated effectually 0 for small fitted values, but they get more and more spread out as the fitted values increment. This is an example of "increasing variance". Here's what the scale-location plot looks like in this example:
plot(lm.increasing, which = 3)
Note the clear upward slope in the red trend line. This tells us we have non-constant variance.
The standard linear regression assumption is that the variance is constant across the unabridged range. When this assumption isn't valid, such as in this example, we shouldn't believe our conviction intervals, prediction bands, or the p-values in our regression.
Normal QQ plot
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. If the residuals look far from normal we may be in trouble. In item, if the balance tend to exist larger in magnitude than what we would look from the normal distribution, then our p-values and confidence intervals may be also optimisitic. i.e., we may fail to adequately account for the full variability of the data.
The ideal instance
First, hither'southward an example of a Normal QQ plot that's as perfect as information technology gets. This comes from the ideal simulation setting in the previous section. The residuals hither are a perfect match to the diagonal line. These residuals look to be normally distributed.
plot(lm.good, which = two)
Lighter tails
In the next case, we run across a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. This plot indicated that the tails are 'lighter' (have smaller values) than what we would expect nether the standard modeling assumptions. This is indicated by the points forming a "flatter" line than than the diagonal.
plot(lm.curved, which = 2)
Heavier tails
In this final instance, nosotros run across a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. Dissimilar the previous plot, in this example nosotros come across that the tails are observed to be 'heavier' (accept larger values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a "steeper" line than the diagonal.
plot(lm.increasing, which = 2)
Outliers and the Residuals vs Leverage plot
There's no single accepted definition for what consitutes an outlier. One possible definition is that an outlier is any indicate that isn't approximated well past the model (has a big residuum) and which significantly influences model fit (has big leverage). This is where the Residuals vs Leverage plot comes in.
The ideal case
Permit'southward look at our ideal setting once once again. The plot below is a great example of a Residuals vs Leverage plot in which we see no evidence of outliers. Those "Melt's distance" dashed curves don't even announced on the plot. None of the points come close to having both high residuum and leverage.
plot(lm.expert, which = five)
An instance with possible outliers
set.seed(12345) y.corrupted <- y.good[one:100] x.corrupted <- x[one:100] # Randomly select 10 points to corrupt to.decadent <- sample(1:length(ten.corrupted), 10) y.corrupted[to.corrupt] <- - one.v * y.corrupted[to.corrupt] + iii * rt(10, df = 3) x.corrupted[to.corrupt] <- x.corrupted[to.corrupt] * two.5 # Fit regression and brandish diagnostic plot lm.corrupted <- lm(y.corrupted ~ x.corrupted) plot(lm.corrupted, which = 5)
In this plot nosotros see that at that place are several points that have high remainder and high leverage. The points that lie close to or outside of the dashed crimson curves are worth investigating further.
Tin can't we just use scatterplots?
All of the examples above were generated by considering the regression of a single outcome variable on a single covariate. In this case, we could've diagnosed nigh of the violations of model assumptions merely by looking at the x-y scatterplot. The reason for using diagnostic plots is that well-nigh regressions we run aren't so simple. Well-nigh regressions use many variables (tens or hundreds of variables), and in those cases in that location isn't a good way of visualizing all of the information. Residuals, fitted values and leverage are all quantities that can exist computed and plotted regardless of how many variables are included in the model. Thus diagnostics such as the Residual vs. Fitted plot, Normal QQ plot and Residual vs. Leverage plot tin assist us even when we have complicated models.
Source: http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
0 Response to "How to Read a Residual Vs Fitted Plot"
Post a Comment