Week 6 Lab

Tue, Aug 1, 2017

In this lab we’re going to use a subset of data from Sikkink et al. (2007; Chapter 26 in AED) on grassland species richness in Yellowstone National park. The data were measured on 8 different transects in 8 years between 1958 and 2002. Not every transect was measured in every year, so the intervals between samples varies from 4 to 11 years in duration. According to Zuur et al. (2009) the Richness variable is the Beta diversity - the number of species unique to that site. Here we’re only using a subset of covariates identified by Sikkink et al. (2007) as important.

library(NRES803)
library(tidyverse)
library(GGally)  # for ggpairs()

# clumsy workaround for new broom behavior - put into a
# function because we will reuse below
augment.other <- function(model) {
    r <- model.frame(model)
    r$.fitted <- fitted(model)
    r$.resid <- resid(model)
    r$.std.resid <- residuals(model, type = "scaled.pearson")
    r$.hat <- model$hat
    r$.cooksd <- cooks.distance(model)
    return(r)
}

data(vegetation)
with(vegetation, table(SAMPLEYR, Time))  # different scale, same thing
with(vegetation, table(SAMPLEYR, Transect))  # every transect sampled every year here


ggpairs(vegetation[, c(-1, -2, -3)])  # might bog down older computer

# many warnings about missing data
sum(complete.cases(vegetation))
nrow(vegetation)

As you can see from the ggpairs() plot, one of the issues we’ll have to deal with is collinearity between the explanatory variables. There is a moderately high negative correlation between ROCK and LITTER, which makes sense as these are both cover variables. There are also a few outliers for FallPrec and a few rows with missing data for Richness. Let’s get rid of the incomplete cases.

vegetation <- vegetation[complete.cases(vegetation), ]

I’ll start by exploring the response as a smooth function of just one of the variables, the maximum spring temperature, SprTmax. With this lab I’m going to follow a different approach to model selection advocated by Andrew Gelman, a statistician from Columbia University. Gelman argues that you should start with the simplest plausible model, and only add variables when you have evidence of a lack of fit. This is similar to forward selection, but is less programmed and involves more thinking. Spring temperature is certainly plausible as it could affect germination of many species.

library(mgcv)
M1 <- gam(Richness ~ s(SprTmax), data = vegetation)
summary(M1)

Looking at the summary, we can see that the average number of species unique to a transect is about 10 (look under parametric coefficients at the estimate of the intercept). The smooth term has approximately 4 degrees of freedom, and is also highly significant. On page 108 of AED Zuur et al. point out that this F-test of the significance of the smoother is only approximate, but that simulation testing has shown it to be robust. However, when we use gam() there is an additional level of uncertainty about the model that arises because we’re estimating the degree of complexity to use in the smoother. Simon Wood (designer of mgcv) suggests that you can trust the conclusion from p-values < 0.001 and > 0.1, but that in between those points you should be cautious. In this case, we’re safe because the p-value for the smooth is \(< 2\times10^{-16}\). At the bottom of the summary R reports the \(R^2\) and the GCV score. The GCV score is the metric that gam() uses to decide how complicated the smooth function needs to be. The absolute value of the GCV varies from data set to data set, so its not clear how useful this is except for comparing different gam() fits to the same dataset. Reviewers and supervisors often have a suspicious response to gam() fits, because there are no coefficients to report. The only thing you can do is plot(M1) to see what the shape of the response is.

plot(M1)

This shows us the response first rises with increasing maximum spring temperature and then decreases. The dotted lines indicate approximate 95% confidence limits, and the rug on the x axis indicates the spread of the data. Notice that as the data thin out to the right, the confidence limits expand. The y-axis shows the change relative to the intercept of the model, which is why the scale looks odd, with negative numbers of species. To see this on a scale that’s more meaningful we need to use augment() with some new data.

library(broom)
nd = data.frame(SprTmax=seq(9,17,0.1))
# use augment() with newdata ARRRGGG
pred.R = augment(M1, newdata=nd, se_fit = TRUE)
p <- ggplot(vegetation, aes(x=SprTmax, y=Richness)) + 
  geom_point() + xlab("Maximum spring temperature")
pgam <- p + geom_line(data=pred.R, mapping=aes(x=SprTmax, y=.fitted)) + 
  geom_ribbon(data=pred.R, aes(x = SprTmax, ymin = .fitted - 1.96*.se.fit, ymax = .fitted + 1.96*.se.fit),inherit.aes = FALSE, alpha=0.5, color = "grey")
pgam

So it appears that warm springs have fewer unique species emerging. Just out of curiosity lets look at the pattern of temperature by time and transect.

# convert Transect to a factor -- it is discrete
vegetation$Transect <- factor(vegetation$Transect)
ggplot(vegetation, aes(x=SAMPLEYR, y = SprTmax,  color=Transect)) + geom_point()

So much of the variation in temperature is variation between transects, with the rank order of transects being fairly consistent within each year. It also appears that there might be 3 groups of transects (1&2, 3 through 6, and 7&8). So the effect of spring temperature could also be a spatial effect to some extent. Although this is stretching the amount of data we have greatly, we could explore to see if there is an interaction between the “group” and temperature to see if it is a temperature effect, a spatial effect, or both. First let us examine the residuals from our GAM.

# r <- augment(M1)


r <- augment.other(M1)

r1 <- ggplot(data = r, aes(x = .fitted, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# this is lifted out of base R qqline()
y <- quantile(r$.resid, c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]

r2 <- ggplot(r, aes(sample = .resid)) + stat_qq() + geom_abline(slope = slope, 
    intercept = int)

r3 <- ggplot(r, aes(x = .fitted, y = sqrt(abs(.std.resid)))) + 
    geom_point() + geom_smooth() + geom_hline(yintercept = 0.822)  # the actual value

r4 <- ggplot(r, aes(.hat, .std.resid)) + geom_vline(size = 2, 
    colour = "white", xintercept = 0) + geom_hline(size = 2, 
    colour = "white", yintercept = 0) + geom_point(aes(size = .cooksd)) + 
    geom_smooth()
library(gridExtra)  # for grid.arrange()
grid.arrange(r1, r2, r3, r4, ncol = 2)

So it looks like there is a bit of heteroscedasticity, but no sign that there is any pattern to the residuals because the 0 line is entirely inside the confidence band on the smooth.

An aside

At this point a nameless skeptic looked over my shoulder while I was writing the lab, and demanded proof that the non-linearity was real. No joke! However, this isn’t unknown as I mentioned above, so let’s look at the linear model and compare them.

lm.1 <- lm(Richness~SprTmax,data=vegetation)
summary(lm.1)

So the gam model has a much better \(R^2\), but uses another 3 degrees of freedom to get that improvement.

This plot compares the predictions.

pred.Rlm <- augment(lm.1, newdata = nd, se_fit = TRUE)
pgam + geom_line(data = pred.Rlm, mapping = aes(x = SprTmax, 
    y = .fitted)) + geom_ribbon(data = pred.Rlm, aes(x = SprTmax, 
    ymin = .fitted - 1.96 * .se.fit, ymax = .fitted + 1.96 * 
        .se.fit), inherit.aes = FALSE, alpha = 0.5, color = "grey")

The conclusions are quite different, especially at low temperatures. Even at higher temperatures there is a steep drop in richness going from 11 to 13 degrees, and then a much lower slope above that.

How can we tell if the linear model is sufficient? One way is to look at the residuals, and see if there is any kind of pattern in them.

r <- augment(lm.1)  # this still works!
r1 <- ggplot(data = r, aes(x = .fitted, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# this is lifted out of base R qqline()
y <- quantile(r$.resid, c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]

r2 <- ggplot(r, aes(sample = .resid)) + stat_qq() + geom_abline(slope = slope, 
    intercept = int)

r3 <- ggplot(r, aes(x = .fitted, y = sqrt(abs(.std.resid)))) + 
    geom_point() + geom_smooth() + geom_hline(yintercept = 0.822)

r4 <- ggplot(r, aes(.hat, .std.resid)) + geom_vline(size = 2, 
    colour = "white", xintercept = 0) + geom_hline(size = 2, 
    colour = "white", yintercept = 0) + geom_point(aes(size = .cooksd)) + 
    geom_smooth()
grid.arrange(r1, r2, r3, r4, ncol = 2)

There is a pattern in the residuals vs. fitted plot at the higher values of predicted richness. The heteroscedasticity we mentioned before is also clearer in the scale-location plot. These plots didn’t convince my skeptic that the GAM was necessary. Ideally, we’d compare the two models using ANOVA or AIC and judge that way. However, comparing models fitted by different functions is fraught with difficulty. What we need to do is use gam() to fit the simpler model.

M0 <- gam(Richness ~ SprTmax, data = vegetation)
anova(M0, M1, test = "F")

So the change in deviance is unlikely to have occurred by chance. The p-value is on the border of the region where we would want to be careful about rejecting the null hypothesis, but still OK. Anyway, it convinced my household skeptic.

Back to the lab

Before I was so rudely interrupted I was going to examine the residuals vs. SAMPLEYR and Transect.

# screwed up my residual df so make new one add the SAMPLEYR
# and Transect columns
r <- vegetation %>% select(SAMPLEYR, Transect) %>% bind_cols(augment.other(M1))
ggplot(r, aes(x = SAMPLEYR, y = .resid)) + geom_point() + geom_smooth() + 
    geom_hline(yintercept = 0, linetype = 2)

The plot by year doesn’t look too bad, perhaps a bit of heteroscedasticity, and a hint of a shift in richness from the 1960’s through to the 1970’s. Less than 1 species on average.

ggplot(r, aes(x = Transect, y = .resid)) + geom_point() + geom_smooth() + 
    geom_hline(yintercept = 0, linetype = 2)

# no smooth appeared because x axis is a factor

Although it doesn’t look too bad, there is again a suggestion that something is varying across transects. It doesn’t appear to match up with the groups of transects in the temperature by year plot, as transects 3 & 4 are much lower than transects 5 & 6.

# use indexing to set up a group category
r$Group <- factor(c(1, 1, 2, 2, 2, 2, 3, 3)[r$Transect])
ggplot(r, aes(x = Group, y = .resid)) + geom_boxplot()

It certainly appears that there is something varying by transect that affects species richness. Looking back at the ggpairs() plot it seems like the variable BARESOIL is negatively correlated with Richness, and not very correlated with SprTmax.

ggplot(vegetation, aes(x=SAMPLEYR, y = BARESOIL, color = Transect)) + 
  geom_point()

There’s definately a temporal trend in baresoil amount, but not much relationship with transect. Let us include it in the model and see what happens.

M2 <- gam(Richness ~ s(SprTmax) + s(BARESOIL), data = vegetation)
summary(M2)

Our intercept hasn’t changed, the \(R^2\) has increased a bit, and the GCV score dropped a bit. However, look at the estimated degrees of freedom for the BARESOIL variable; it is 1. This means we can safely include it as a linear term in the model.

M2 <- gam(Richness ~ s(SprTmax) + BARESOIL, data = vegetation)
summary(M2)
anova(M1, M2, test = "F")

So that is a significant improvement in the Deviance.

It’s also possible that the effect of spring temperature and baresoil interact, and we can build that in using a “by” argument to the smooth term. Placing a numeric variable in the by argument tells gam() to multiply the smooth by the variable - effectively creating an interaction between baresoil and the smooth function of spring temperature.

M3 <- gam(Richness ~ s(SprTmax, by = BARESOIL) + BARESOIL, data = vegetation)
summary(M3)

This has increased the complexity of the model and actually makes the deviance worse, so let’s not go there. Next, examine the residuals. There is a function that does a bit of the residual plotting for us, gam.check() - it also does some diagnostic checks on the smoother.

gam.check(M2)

The quantile-quantile plot and histogram of the residuals look reasonable, but there is still that annoying amount of heteroscedasticity in the residuals. Let’s repeat the plots of residuals against year and transect again too.

r2 <- vegetation %>% select(SAMPLEYR, Transect) %>% bind_cols(augment.other(M2))  # huh, for some reason broom is ok with M2??!?
pr2a <- ggplot(r2, aes(x = SAMPLEYR, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# no smooth here b/c x axis is a factor
pr2b <- ggplot(r2, aes(x = Transect, y = .resid)) + geom_point() + 
    geom_hline(yintercept = 0, linetype = 2)
grid.arrange(pr2a, pr2b, ncol = 2)

There’s still more pattern there than I like to see. Let’s see if any of our other covariates vary among transects or years.

ggplot(vegetation, aes(x = Transect, y = BARESOIL)) + geom_boxplot()

ggplot(vegetation, aes(x = SAMPLEYR, y = BARESOIL, color = Transect)) + 
    geom_point() + geom_line()

The percentage of Bare rock looks interesting - it varies over time within transects, but also much more between transects, and in a way that matches the residual pattern, maybe.

ggplot(vegetation, aes(x = Transect, y = LITTER)) + geom_boxplot()

ggplot(vegetation, aes(x = SAMPLEYR, y = LITTER, color = Transect)) + 
    geom_point() + geom_line()

Litter varies too, but much more between years than between transects - the lines cross alot more.

ggplot(vegetation, aes(x = Transect, y = FallPrec)) + geom_boxplot()

ggplot(vegetation, aes(x = SAMPLEYR, y = FallPrec, color = Transect)) + 
    geom_point() + geom_line()

Fall Precipitation varies alot among years, without much variation between transects within years. So after Spring temperature and Baresoil, Rock has the highest correlation with Richness, and seems like it might reflect variation among transects. Litter is less well correlated with Richness, and is also highly correlated with Rock, so we’ll leave it out. Fall precipitation also seems like it could easily affect germination rates.

M4 <- gam(Richness ~ s(SprTmax) + BARESOIL + s(ROCK), data = vegetation)
summary(M4)
plot(M4)

anova(M2, M4, test = "F")

This addition gave us a huge jump in the \(R^2\), and a very significant change in the deviance. We might consider a model that allows for an interaction between ROCK and SprTmax, which we do by introducing a 2 dimensional smooth term. This term will capture responses not modelled by the independent smooth functions of temperature and Rock. I changed the basis for this new term to “ts”, which is a version of the thin plate regression spline that will allow the degrees of freedom to approach zero. If it does approach zero, then we know we can safely remove that term as it isn’t contributing to the fit.

M5 <- gam(Richness~s(SprTmax)+s(ROCK)+s(SprTmax,ROCK,bs="ts")+BARESOIL,
          data=vegetation)
summary(M5)

There is no change in the \(R^2\) and GCV scores, and the estimated degrees of freedom for the 2-D term is essentially zero, so we can leave it out.

Just to be sure, run a model changing the basis of the two single variable smooth terms in M5 to “ts” as well, admitting the possibility that those smooth terms should be removed. Compare that model with M4. Do we need the 2D interaction term?

Now lets reexamine the residuals to see how we’re doing.

gam.check(M4)

r4 <- vegetation %>% select(SAMPLEYR, Transect) %>% bind_cols(augment.other(M4))
pr4a <- ggplot(r4, aes(x = SAMPLEYR, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# no smooth here b/c x axis is a factor
pr4b <- ggplot(r4, aes(x = Transect, y = .resid)) + geom_point() + 
    geom_hline(yintercept = 0, linetype = 2)
grid.arrange(pr4a, pr4b, ncol = 2)

There’s still a bit of pattern left in the plot of residuals vs. transect, but more of the form of heteroscedasticity. Normally I think I would stop here, or maybe try a different family to see if the heteroscedasticity can be fixed. However, I want to try a few more models. Let’s see if precipitation variation between years has anything to tell us.

M6 <- gam(Richness ~ s(SprTmax) + BARESOIL + s(ROCK) + s(FallPrec), 
    data = vegetation)
summary(M6)
anova(M4, M6, test = "F")

Although the model improves slightly, the new smooth term is not significant, and the change in deviance between M4 and M6 is right in the doubtful region. How about the residuals?

gam.check(M6)

r6 <- vegetation %>% select(SAMPLEYR, Transect) %>% bind_cols(augment.other(M6))
pr6a <- ggplot(r6, aes(x = SAMPLEYR, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# no smooth here b/c x axis is a factor
pr6b <- ggplot(r6, aes(x = Transect, y = .resid)) + geom_point() + 
    geom_hline(yintercept = 0, linetype = 2)
grid.arrange(pr6a, pr6b, ncol = 2)

These look much better. So we’re in a bit of a quandry - Fall precipitation helps, but not conclusively. I want to try 2 more models, one that include transect and a threshold model of year.

The threshold model is suggested by the residual pattern vs. SAMPLEYR for M4 - looks like there is a “step” change around 1980. It could be that something changed in the park around 1980, and this is sort of correlated with a change in precipitation. I’m going to create a variable that says Year > 1980, and use that as a categorical variable in the model.

vegetation$Period <- factor(vegetation$SAMPLEYR > 1980)
M7 <- gam(Richness ~ s(SprTmax) + BARESOIL + s(ROCK) + Period, 
    data = vegetation)
summary(M7)
anova(M4, M7, test = "F")

This is better than M4 although not by much.

gam.check(M7)

r7 <- vegetation %>% select(SAMPLEYR, Transect) %>% bind_cols(augment.other(M7))
pr7a <- ggplot(r7, aes(x = SAMPLEYR, y = .resid)) + geom_point() + 
    geom_smooth() + geom_hline(yintercept = 0, linetype = 2)

# no smooth here b/c x axis is a factor
pr7b <- ggplot(r7, aes(x = Transect, y = .resid)) + geom_point() + 
    geom_hline(yintercept = 0, linetype = 2)
grid.arrange(pr7a, pr7b, ncol = 2)

Not really any improvement compared with M4.

Fit M8 adding Transect to M4, compare with M4 and check residuals.

What about that heteroscedasticity? Now let’s see, we’ve got a variable that only takes integer values and can’t be less than zero … any thoughts?

Refit model 4 with a different family. Check the residuals and other assumptions, decide which model is better. Use AIC to adjudicate between your normal error model and this new model.
The other way to fix heteroscedasticity is to log transform the response. Does that work here? Why or why not? Can you compare this model with M4 directly?
Using the model of your choice, make a plot of the predicted values of Richness as a function of spring maximum temperature for two different amounts of baresoil and the median amount of ROCK. Make a second plot with ROCK on the x - axis, 2 levels of SprTmax, and median baresoil.

Week 6 Lab

An aside

Back to the lab

Andrew Tyre

Professor of Wildlife Ecology