This week will give you further practice with making plots, estimating linear models in R, and getting predicted values from those fitted models. This exercise will be completed by Friday afternoon at 5pm. Create a new project in RStudio named “yourlastname_lab2”. Submit the compressed project directory, which should include an R Markdown file that can be compiled directly to html without errors. The answers to the numbered questions should be in plain text. The code should be in R code chunks.
Load the NRES803 package.
library(NRES803) data(RIKZ) dim(RIKZ)
This file has 45 observations and 89 variables. Many of those variables are abundance measures for single species. These are in columns 2 through 76. To calculate species richness for each site we need to first use a logical variable to seperate the zeros from positive abundances. Then we add up each row. We can use a convenient function rowSum() that calculates the sum of each row in a matrix. The “>0” part converts the count observations of species in columns 2 through 76 into presence/absence of each species. The result is a single vector that contains the species richness, or alpha diversity, of each site in the dataset.
RIKZ$Richness <- rowSums(RIKZ[,2:76] > 0) dim(RIKZ)
You can see that we now have one more variable than before. Now we want to plot species richness against NAP, which is a measure of how far up the beach the sampling location is.
library(ggplot2) # make "Beach" into a factor otherwise treated as continuous RIKZ$fBeach <- factor(RIKZ$Beach) Richness_NAP <- ggplot(RIKZ, aes(x = NAP, y = Richness)) + geom_point(aes(color=fBeach)) Richness_NAP
I’ve saved the output of
ggplot() into a variable so that I can simply reuse that object later when I want to add something to the plot. This isn’t necessary, but saves on copying and pasting code if you’re making the same plot multiple times. Now we fit a linear model to that data.
RIKZ_model.1 <- lm(Richness ~ NAP, data = RIKZ) summary(RIKZ_model.1)
You’ve already done some interpretation of that model in the readiness assessment. Let’s take a look at the assumptions of that model.
library(broom) test_RIKZ <- augment(RIKZ_model.1, data=RIKZ) ggplot(test_RIKZ, aes(x = .fitted, y = .resid)) + geom_point(aes(color=fBeach)) + geom_smooth() # this is lifted out of base R qqline() y <- quantile(test_RIKZ$.resid, c(0.25, 0.75)) x <- qnorm(c(0.25, 0.75)) slope <- diff(y)/diff(x) int <- y[1L] - slope * x[1L] ggplot(test_RIKZ, aes(sample = .resid)) + stat_qq() + geom_abline(slope = slope, intercept = int) ggplot(test_RIKZ, aes(x = .fitted, y = sqrt(abs(.std.resid)))) + geom_point() + geom_smooth() + geom_hline(yintercept = 1) ggplot(test_RIKZ, aes(.hat, .std.resid)) + geom_vline(size = 2, colour = "white", xintercept = 0) + geom_hline(size = 2, colour = "white", yintercept = 0) + geom_point(aes(size = .cooksd)) + geom_smooth(se = FALSE)
- Briefly interpret what those plots are telling you about each of the four assumptions of the linear model.
Next we will look for other important variables. It would not be uncommon for ecological communities to vary over time.
- Plot species richness against the variable week. Save the plot object in an object named
Do you think the timing of the sample collection matters? Test your intuition with a linear model
RIKZ_model.2 = lm(Richness ~ week, data = RIKZ) summary(RIKZ_model.2)
- Does week of sampling affect species richness? How well does the fitted line represent the variation among weeks? (hint graphically check assumptions)
Notice that there are only 4 weeks, with multiple observations per week. Another way to analyse this type of variable is to treat it as a categorical variable, and estimate a different parameter for each week. We can do this quickly like this:
RIKZ_model.3 <- lm(Richness~factor(week), data=RIKZ)
However, this ends up labeling the parameters as factor(week) which is a bit odd, and makes it more difficult to do predictions from the model. A better approach is to change the variable in the dataframe:
RIKZ$fweek = factor(RIKZ$week) RIKZ_model.3 = lm(Richness ~ fweek, data = RIKZ)
- Which model, 2 or 3, is a better predictor of Species Richness? Does model 3 fit the assumptions of the linear model?
Now we would like to find out what the estimated species richness is in each week. We could do the calculations manually, but there is a better way that uses the fitted model object. First we need a data.frame that contains one row for each estimate, and one column for each covariate in the model that we want to predict from. We can call this data.frame anything, but I prefer to call it nd, for “new data”. In our case, we are going to have 4 rows, one for each week, as a factor called fweek:
nd <- data.frame(fweek=factor(levels(RIKZ$fweek))) # There is a lot going on in that one line of code. The best way to break this down is to run each piece from the “inside” out, so: RIKZ$fweek # shows us the original data levels(RIKZ$fweek) # gives a vector of the unique levels factor(levels(RIKZ$fweek)) # creates a new “factor” # create a data.frame with one column called fweek data.frame(fweek=factor(levels(RIKZ$fweek))) # and save it in an object named ‘nd’ nd <- data.frame(fweek=factor(levels(RIKZ$fweek)))
Whenever you come across a bit of R code that is doing something complicated, you can use this approach to decompose what is going on. This is a very powerful way to learn R.
Now we need to use the nd data.frame, together with the fitted model object, to get the predicted values:
This gives us a dataframe of mean species richness in each week and the standard errors. So the mean species richness in week 3 is 4.2±0.96 (SE). That’s nice, but it would be nicer if we could plot it on a figure. Save the predicted means and standard errors to an object named ‘predicted.Richness’ and make the plot again.
predicted.Richness <- augment(RIKZ_model.3, newdata=nd) # have to set color explicitly otherwise looks for fbeach in predicted.Richness richness_week + geom_point(aes(x = fweek, y = .fitted), color = "red", data = predicted.Richness)
That will fail with an error about a discrete value being supplied to a continuous scale. The problem is that we made the object richness_week with week not fweek. So let’s redo that object.
predicted.Richness <- augment(RIKZ_model.3, newdata=nd) # redo the plot object richness_week <- ggplot(RIKZ, aes(x = fweek, y = Richness, color = fBeach)) + geom_point() # have to set color explicitly otherwise looks for fbeach in predicted.Richness richness_week + geom_point(aes(x = fweek, y = .fitted), color = "red", size = 2, data = predicted.Richness) + geom_errorbar(aes(x = fweek, ymin = .fitted - 1.96*.se.fit, ymax = .fitted + 1.96*.se.fit), color = "red", data = predicted.Richness, inherit.aes = FALSE)
This same procedure works with continuous variables too. If we only have a single linear covariate we can add
geom_smooth(method = "lm") to plot the fitted line but that won’t work for more complex models or for categorical variables.
Both NAP and fweek seem to affect the response. Fit a linear regression model with main effects of both NAP and fweek, like this:
RIKZ_model.4 = lm(Richness ~ NAP + fweek, data = RIKZ)
- Examine the summary for this model and compare with the previous models.Also compare the coefficients for NAP and fweek with those estimated in models 1 and 3 – are the qualitative conclusions (sign and magnitude) different when the variables are combined in a single model?
Making a plot with a more complex model is a bit involved. The problem is that we now have to make predictions with both NAP and fweek. So we are going to get 4 lines on a plot of Richness vs. NAP; one for each week.
nd = expand.grid(NAP=seq(-1.5,2.5,0.1), fweek=factor(levels(RIKZ$fweek))) predicted.Richness.4 <- augment(RIKZ_model.4, newdata = nd) # reuse ggplot object from above. fix color to black or geom_line looks for fBeach. x aesthetic inherited from Richness_NAP Richness_NAP + geom_line(aes(y = .fitted, group = fweek, linetype = fweek), color = "black", data = predicted.Richness.4)
It is important that the variable fweek in nd has the same levels as the variable that was used to fit the model; specifying the levels is the best way to ensure this. Notice that the lines are all parallel – that is because we fitted a model with no interaction between the continuous variable NAP and the factor variable fweek. Effectively we have 4 intercepts and one slope.
Now fit a model that includes both NAP, fweek, and their interaction:
RIKZ_model.5 = lm(Richness~NAP*fweek,data=RIKZ)
Look carefully to see the difference in the code – it is a single character! However that single change adds 3 parameters to our model. Now we have 8 parameters, essentially an intercept and slope for each week.
- Examine the summary for this model and compare with the previous models.Also compare the coefficients for NAP and fweek with those estimated in models 1, 3 and 4 – are the qualitative conclusions (sign and magnitude) different with the interaction?
The three interaction coefficients can be thought of as a change in the slope of NAP for weeks 2, 3 and 4. So we expect the line for week 2 to be shallower (less negative), while weeks 3 and 4 should be steeper (more negative). Confirm this interpretation by repeating the figure above using model 5 to get the predictions.
# try with sum-to-zero contrasts options(contrasts = c("contr.sum","contr.poly")) RIKZ_model.6 = lm(Richness~NAP*fweek,data=RIKZ)