Week 4 Lab

Tue, Aug 1, 2017

We’re going to use the same dataset that you generated last week for this week’s exercise. Last week we looked at choosing a best model. This week we’re going to model average the coefficients and predictions from the models, and make “partial dependence plots”. As before, we’ll first do each step by hand, then with AICcmodavg. Below I’ve assumed you still have all the objects created in week 3 lab. You can ensure this by simply using the same RStudio project you created for the week 3 lab. Then you need to include code from Lab 3 in your Rmd file for lab 4. There are 3 options in increasing order of sophistication:

Open yourname_lab3.Rmd (or whatever you called it). Save it as yourname_lab4.Rmd. Continue working at the end of the file, making one big document. This is entirely fine.
Copy the code chunks out of your lab3 Rmd file into a code chunk at the top of a new lab 4 Rmd file. You can leave out code that makes plots and tables. All you need is code that assigns operations to objects using <-. This one is tedious and subject to making alot of mistakes.
Include the following line in your first code chunk, before you run anything else.

source(knitr::purl("yourname_lab3.rmd", output = tempfile()))

The function knitr::purl() pulls all the code chunks out of the Rmd file and puts them in an R script. source() then runs that script. To be super clean, add include = FALSE, echo=FALSE to the code chunk options, which go at the start of the code chunk like this ```{r} change to ```{r , include = FALSE, echo=FALSE} that will not show the code or results in the output document.

As before, answer the numbered questions in the regular text section of your R markdown file.

Manually model average the intercept

First let’s look at the intercept. We need to get a vector of the estimated intercepts from each model:

library(broom)  # for tidy
library(tidyverse)  # for map_chr(), map_df(), filter(), mutate()

# old way intercepts = sapply(fits,function(mm)coef(mm)[1])
# new way apply tidy() to each element of the list
intercepts <- map_df(fits, tidy) %>% filter(term == "(Intercept)") %>% 
    mutate(models = modnames)

The estimated intercepts are stored in the list object fits, and we use the function tidy() to pull them out into a dataframe. tidy() only works on a single model object, so we use map_df() to apply tidy to each object in the list, and pull all the results into a data.frame. Then we use filter() to keep just the rows with estimated intercepts. Explore this code a bit by using the individual functions, and looking carefully at the returned results, including the new dataframe intercepts. Do not include these interactive explorations in your submitted R markdown file.

I left the “old way” of extracting these estimates in there for you to compare. Notice that the result is a single vector of the estimates, not a data.frame. Although the tidyverse approach looks more complex, with a couple of lines of code we’ve made several subsequent steps of the old way completely obsolete!

Now we need to calculate the weighted average of the intercepts, using the AIC weights. There are at least two ways to do this – but in the spirit of KISS we’ll use the brute force option:

mavg.intercept <- sum(weights * intercepts$estimate)

Here we’ve multiplied each element of the vector weights with its corresponding element in intercepts and then used the function sum() to add up the results.

Compare the result to the estimated intercepts for the two top models, 12 and 16.

Next we need to calculate the variance of our model averaged estimate. Variances can be added together, so first square our standard errors to get the variance of each estimate.

intercepts <- intercepts %>% mutate(var_est = std.error^2)

Now we calculate a weighted average of the variances, but we also have to include a component of how different the estimates are between models (see page 111 in MBILS). Vectorized calculations are our friend here:

sum(weights * (intercepts$var_est + (intercepts$estimate - mavg.intercept)^2))

Compare this “unconditional variance” with the value of the variance for the top two models.

Model averaging coefficients not in every model

The intercept was easy because it occurs in every model. What if we want to model average a parameter that does not occur in every model, such as Amount? There are two problems to solve – how to extract the coefficients from the models that have Amount, and what to do with the models Amount does not occur in. The first is actually pretty easy – we just change the filter() condition above. For convienence later on, I’m going to add the modnames to the list fits. Then map_df() will use the names of the list to create a variable in the result indicating which list item the row comes from.

names(fits) <- modnames
amount <- map_df(fits, tidy, .id = "modname") %>% filter(term == 
    "Amount")

If you look at the data.frame Amount you can see that it has fewer rows than the number of models. That is because the covariate Amount only appears in some of the models. We have two choices about how to deal with this fact. The first option just uses the weights from the models with Amount in them, re-normalizing the weights to add to 1. The other option is to assume the models without Amount have an estimated value of zero. We’ll try both.

Renormalizing weights to 1

We have to do two things: match the weights up to the smaller dataframe, and then change them to sum to one. The first task is solved handily by using a join operation. I will first add the modnames column to the dataframe with all the AIC statistics, then join the two tables together.

# have to add modnames to aic.table too!
aic.table$modname <- modnames
# we want all the rows in Amount, and all columns in both
# that is a left join in database speak. see ?join in dplyr
# for all the options
amount <- left_join(amount, aic.table, by = "modname")
names(amount)

As you can see, we now have added the aic.table columns to the Amount data.frame. Manually compare a couple of rows from Amount with the values in aic.table to convince yourself this magic actually works. Now we renormalize the weights to 1

sum(amount$weights)  # less than 1 but not by much
amount$weights <- amount$weights/sum(amount$weights)

and then do the model averaging calculations:

# use with() to make code shorter -- looks for objects in
# Amount first.
mavg.amount <- with(amount, sum(weights * estimate))
amount$var_est <- amount$std.error^2
mavg.var.amount <- with(amount, sum(weights * (var_est + (estimate - 
    mavg.amount)^2)))

Compare the model averaged coefficient and it’s variance with the estimates from models 12 and 16. Does the model averaging change your conclusion about the effect of this variable?

Use all models, missing estimates set to 0

The other way to do model averaging, which I prefer, is to assume the estimate is zero for those models in which it doesn’t appear. This “shrinks” the estimate towards zero, and by a greater amount if models without Amount have substantial weight on them. The trick is to use a different join operation, one that “fills in” with missing values rows that are missing in the lefthand table.

# remake the Amount table
amount <- map_df(fits, tidy, .id = "modname") %>% filter(term == 
    "Amount")
amount_allrows <- full_join(amount, aic.table, by = "modname")

Notice that the last 8 rows have values from aic.table, but NA for all the columns in Amount. We want to set the estimates and standard errors for those rows to zero.

# create a logical vector -- avoid relying on order
pick <- is.na(amount_allrows$estimate)
amount_allrows[pick, c("estimate", "std.error")] <- 0

And now we can use exactly the same code to get the model averaged estimates.

mavg.amount2 <- with(amount_allrows, sum(weights * estimate))
amount_allrows$var_est <- amount_allrows$std.error^2
mavg.var.amount2 <- with(amount_allrows, sum(weights * (var_est + 
    (estimate - mavg.amount2)^2)))

Compare those model averaged estimates with the ones from the conditional average. Again, does changing the method change your conclusion about the effect of Amount on the response?
Recall from Week 3 that we actually know the true coefficients; Amount should be 2. Compare the top models and model averaged coefficients with the true value.

Model average all the things

Ideally we want to get a table with the model averaged coefficients for every parameter in the model. We could copy and paste the calculations above for each parameter, then bind them all together into a data.frame. Or we could do all the calculations at once. We start by simply not filtering when we create our table of parameter estimates.

# get all the things
estimates <- map_df(fits, tidy, .id = "modname")

Start with the version that conditions the weights on the models containing each parameter. First we need to get the weights associated with each model attached to the right rows. Back to left_join()

estimates1 <- left_join(estimates, aic.table[, c("weights", "modname")], 
    by = "modname")
head(estimates1)

We have multiple rows for each model, one for every parameter in the model. I restricted the columns of aic.table just for simplicity. Now we are going to repeat all the calculations above for each value of term, and summarize the result into a single data.frame.

estimates1 %>% group_by(term) %>% mutate(wnorm = weights/sum(weights), 
    var_est = std.error^2) %>% 
summarize(avg_est = sum(estimate * wnorm), avg_var = sum(wnorm * 
    (var_est + (estimate - avg_est)^2)))

Now that’s pretty nice! We did everything in about 7 lines of code.

Now we want to create a dataframe with the model averaged estimate, standard error, and a t-test comparing the estimate against the known true value and turn it into a nice output table using package pander.

# repeat code from above but save to object
coef_table1 <- estimates1 %>% group_by(term) %>% mutate(wnorm = weights/sum(weights), 
    var_est = std.error^2) %>% 
summarize(avg_est = sum(estimate * wnorm), avg_var = sum(wnorm * 
    (var_est + (estimate - avg_est)^2)))

coef_table1$avg_SE <- sqrt(coef_table1$avg_var)

# to get the t statistics we need a bit more information,
# what are the true parameter values
coef_table1$true_parms <- c(0, 2, -2, 2, 0)

coef_table1$t_value <- (coef_table1$avg_est - coef_table1$true_parms)/coef_table1$avg_SE

coef_table1$p_value <- 2 * (1 - pt(abs(coef_table1$t_value), 
    df = 350 - 5))

# print while removing variance column pander(coef_table1[, -
# 3]) # broken!
knitr::kable(coef_table1[, -3], digits = 2)

Also make a table of the top model (hint: start with the output of tidy(fits[[12]])). Compare the conclusions you make about each effect in the model by using the top model or the model averaged coefficients.

Now let’s do it with the other option, assuming the estimates are zero when a parameter isn’t present in the model. We need to flesh out our dataframe estimates with the rows that are missing. A full join will do this, but first we need a table with all the coefficients and all the model names in it.

parms <- crossing(term = c("(Intercept)", "Amount", "Edge", "Hetero", 
    "MnPatch"), modname = modnames)
# if no by = argument, uses all columns with identical names
estimates2 <- full_join(estimates, parms)

Now we have a table with 5 parameters x 16 models = 80 rows. We can use nearly the same code as before, just skipping the normalization of the weights.

# don't forget to add the model weights!
estimates2 <- left_join(estimates2, aic.table[, c("weights", 
    "modname")], by = "modname")

estimates2 %>% group_by(term) %>% mutate(estimate = if_else(is.na(estimate), 
    0, estimate), std.error = if_else(is.na(std.error), 0, std.error), 
    var_est = std.error^2) %>% summarize(avg_est = sum(estimate * 
    weights), avg_var = sum(weights * (var_est + (estimate - 
    avg_est)^2)))

Make the nice data_frame and put it in a pretty table. Compare the results with the other model averaging approach.

Now the easy way

All those calculations can be done for you by package AICcmodavg.

library(AICcmodavg)
# conditional on inclusion in a model
modavg(fits, "Amount", modnames = modnames)
# with all the models
modavgShrink(fits, "Amount", modnames = modnames)

So that was less code, but pulling it all out into a table for all parameters would be more painful than the dplyr based manual code above. You have to do each parameter individually.

Finally, let’s make a plot. The design of the plot below was suggested by Shivani Jadeja a couple of years ago.

gg1 <- ggplot(smithsim, aes(x = Edge, y = Y, color = Amount)) + 
    geom_point()
# make prediction data
nd <- expand.grid(Edge = -3:3, Amount = quantile(smithsim$Amount, 
    p = c(0, 0.5, 1)), Hetero = median(smithsim$Hetero), MnPatch = median(smithsim$MnPatch))
nd <- augment(fits[[12]], newdata = nd)
gg1 + geom_line(aes(y = .fitted, group = Amount), data = nd)

The color shows the value of amount for each point, a third dimension. The lines are the prediction from the top model, and automatically are colored to match the points so you can see the value of Amount that gives you that prediction. The lines are parallel because there is no interaction term.