One of the exercises in Ecological Statistics fits a species area curve to data from the Galapogos Islands. I ask students to compare the estimated power coefficient \(z\) with the value 0.27 reported in a meta-analysis by Drakare et al. 2006. But Drakare et al. only reported the mean, not the standard error.1
To do the comparison properly, I need to know how uncertain the average \(z\) is. I looked through the paper carefully, but couldn’t find a standard error anywhere! The individual estimates used in the meta-analysis are in an appendix, but the standard errors are not reported there either. They do exist, because Drakare et al. discuss the value of a “weighted” average, where poor estimates (large SE) of \(z\) get reduced weight. Oh well, I’ll do the best I can. I think if I ignore the variation between studies I’ll get an overestimate of the significance of the difference; inflated Type I error probability.
library("tidyverse")
library("NRES803") #devtools::install_github("atyre2/NRES803")
library("faraway") # for the data
drakare <- read_csv("https://drewtyre.rbind.io/post/data/Drakare_etal_2006_species-area-curves-review.csv")
The data come from Johnson and Raven (1973); I should check if any of the estimates in the Drakare data come from there.
grep("Raven", drakare$Oref)
## [1] 194 196
drakare[193:197,"Oref"]
## # A tibble: 5 x 1
## Oref
## <chr>
## 1 Hulten E. (1960) Flora of the Aleutian Islands. Cramer, Weinheim.
## 2 Johnson M.P. & Raven P.H. (1973) Species number and endemism: the galapagos A~
## 3 Johnson M.P. & Simberloff D.S. (1974) Environmental determinants of island sp~
## 4 Johnson M.P., Mason L.G. & Raven P.H. (1968) Ecological parameters and plant ~
## 5 <NA>
OK, we need to remove row 194. I included rows before and after the hits because sometimes more than one estimate is taken from a single reference, and the references are missing for those extra rows.
drakare <- drakare[-194,]
OK. Now I’ll get a mean and standard error for all the studies.
drakare %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()))
## # A tibble: 1 x 2
## mean_z se_z
## <dbl> <dbl>
## 1 NA NA
Oh right, there are some studies that didn’t report z for the power method.
I could use na.rm = TRUE
for both mean()
and sd()
, but then
n()
would be reporting the incorrect number.
drakare %>%
filter(!is.na(z_power)) %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()),
n = n())
## # A tibble: 1 x 3
## mean_z se_z n
## <dbl> <dbl> <int>
## 1 0.297 0.00936 784
Cooool. The SE is tiny because there are so many observations. Also, weighting the observations by their variance makes quite a difference – the weighted value reported is 0.27 compared to 0.30 unweighted.
OK, but before I get carried away, Drakare et al. did find differences by method (nested vs. independent) and “realm” (aquatic vs. terrestrial).
z_estimates <- drakare %>%
filter(!is.na(z_power)) %>%
group_by(method, Realm) %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()),
n = n())
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
z_estimates
## # A tibble: 8 x 5
## # Groups: method [4]
## method Realm mean_z se_z n
## <chr> <chr> <dbl> <dbl> <int>
## 1 both aqu 0.122 0.0218 5
## 2 both ter 0.143 0.0203 6
## 3 ind aqu 0.255 0.0201 121
## 4 ind ter 0.281 0.0135 433
## 5 nes aqu 0.285 0.0189 41
## 6 nes ter 0.367 0.0184 127
## 7 <NA> aqu 0.267 0.0490 12
## 8 <NA> ter 0.442 0.0618 39
The Galapagos data is terrestrial and independent, because each observation is a distinct island. Now get the estimated coefficient of z for the Galapagos islands.
gala.1 <- lm(log(Species)~log(Area),data=gala)
summary(gala.1)
##
## Call:
## lm(formula = log(Species) ~ log(Area), data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5442 -0.4001 0.0941 0.5449 1.3752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9037 0.1571 18.484 < 2e-16 ***
## log(Area) 0.3886 0.0416 9.342 4.23e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7842 on 28 degrees of freedom
## Multiple R-squared: 0.7571, Adjusted R-squared: 0.7484
## F-statistic: 87.27 on 1 and 28 DF, p-value: 4.23e-10
Consulting wikipedia I believe I want
\[ t = \frac{\bar{X_1}-\bar{X_2}}{s_{\bar{\Delta}}} \]
where
\[ s_{\bar{\Delta}} = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \]
I already have the things inside the square root; they are the square of the estimated standard errors. So …
gala.coefs <- tidy(gala.1)
s_combined <- sqrt(z_estimates[4, "se_z"]^2 + gala.coefs[2, "std.error"]^2)
s_combined
## se_z
## 1 0.04372116
which is ever so slightly bigger than the se from the model alone.
t <- (gala.coefs[2, "estimate"] - z_estimates[4, "mean_z"])/s_combined
t
## estimate
## 1 2.452262
OK, to get a p value for that, I need to get the degrees of freedom … it’s ugly.
(z_estimates[4, "se_z"]^2 + gala.coefs[2, "std.error"]^2)^2 /
(z_estimates[4, "se_z"]^4/433 + gala.coefs[2, "std.error"]^4/28)
## se_z
## 1 34.15
That makes sense … it is a little bigger than the sample size for the larger variance.
# have to coax t into a naked number, many ways to do this
pt(t[1,1, drop = TRUE], df = 34.15, lower.tail = FALSE)
## [1] 0.009730172
Nice – so that is a test of the hypothesis that the estimated coefficient from Galapagos plants is the same as the unweighted average for all independently sampled terrestrial systems. The p value is the probability of getting an estimate as large as 0.39 or larger, given the null hypothesis was true. So the Galapagos Islands plant communities are not like the other things. I’m not sure what to make of that ecologically, but it’s cool.