# Comparing an estimated coefficient with a constant

One of the exercises in Ecological Statistics fits a species area curve to data from the Galapogos Islands. I ask students to compare the estimated power coefficient $$z$$ with the value 0.27 reported in a meta-analysis by Drakare et al. 2006. But Drakare et al. only reported the mean, not the standard error.1

To do the comparison properly, I need to know how uncertain the average $$z$$ is. I looked through the paper carefully, but couldn’t find a standard error anywhere! The individual estimates used in the meta-analysis are in an appendix, but the standard errors are not reported there either. They do exist, because Drakare et al. discuss the value of a “weighted” average, where poor estimates (large SE) of $$z$$ get reduced weight. Oh well, I’ll do the best I can. I think if I ignore the variation between studies I’ll get an overestimate of the significance of the difference; inflated Type I error probability.

library("tidyverse")
library("NRES803") #devtools::install_github("atyre2/NRES803")
library("faraway") # for the data
drakare <- read_csv("https://drewtyre.rbind.io/post/data/Drakare_etal_2006_species-area-curves-review.csv")

The data come from Johnson and Raven (1973); I should check if any of the estimates in the Drakare data come from there.

grep("Raven", drakare\$Oref)
## [1] 194 196
drakare[193:197,"Oref"]
## # A tibble: 5 x 1
##   Oref
##   <chr>
## 1 Hulten E. (1960) Flora of the Aleutian Islands. Cramer, Weinheim.
## 2 Johnson M.P. & Raven P.H. (1973) Species number and endemism: the galapagos A~
## 3 Johnson M.P. & Simberloff D.S. (1974) Environmental determinants of island sp~
## 4 Johnson M.P., Mason L.G. & Raven P.H. (1968) Ecological parameters and plant ~
## 5 <NA>

OK, we need to remove row 194. I included rows before and after the hits because sometimes more than one estimate is taken from a single reference, and the references are missing for those extra rows.

drakare <- drakare[-194,]

OK. Now I’ll get a mean and standard error for all the studies.

drakare %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()))
## # A tibble: 1 x 2
##   mean_z  se_z
##    <dbl> <dbl>
## 1     NA    NA

Oh right, there are some studies that didn’t report z for the power method. I could use na.rm = TRUE for both mean() and sd(), but then n() would be reporting the incorrect number.

drakare %>%
filter(!is.na(z_power)) %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()),
n = n())
## # A tibble: 1 x 3
##   mean_z    se_z     n
##    <dbl>   <dbl> <int>
## 1  0.297 0.00936   784

Cooool. The SE is tiny because there are so many observations. Also, weighting the observations by their variance makes quite a difference – the weighted value reported is 0.27 compared to 0.30 unweighted.

OK, but before I get carried away, Drakare et al. did find differences by method (nested vs. independent) and “realm” (aquatic vs. terrestrial).

z_estimates <- drakare %>%
filter(!is.na(z_power)) %>%
group_by(method, Realm) %>%
summarize(mean_z = mean(z_power),
se_z = sd(z_power)/sqrt(n()),
n = n())
## summarise() has grouped output by 'method'. You can override using the .groups argument.
z_estimates
## # A tibble: 8 x 5
## # Groups:   method [4]
##   method Realm mean_z   se_z     n
##   <chr>  <chr>  <dbl>  <dbl> <int>
## 1 both   aqu    0.122 0.0218     5
## 2 both   ter    0.143 0.0203     6
## 3 ind    aqu    0.255 0.0201   121
## 4 ind    ter    0.281 0.0135   433
## 5 nes    aqu    0.285 0.0189    41
## 6 nes    ter    0.367 0.0184   127
## 7 <NA>   aqu    0.267 0.0490    12
## 8 <NA>   ter    0.442 0.0618    39

The Galapagos data is terrestrial and independent, because each observation is a distinct island. Now get the estimated coefficient of z for the Galapagos islands.

gala.1 <- lm(log(Species)~log(Area),data=gala)
summary(gala.1)
##
## Call:
## lm(formula = log(Species) ~ log(Area), data = gala)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.5442 -0.4001  0.0941  0.5449  1.3752
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.9037     0.1571  18.484  < 2e-16 ***
## log(Area)     0.3886     0.0416   9.342 4.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7842 on 28 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7484
## F-statistic: 87.27 on 1 and 28 DF,  p-value: 4.23e-10

Consulting wikipedia I believe I want

$t = \frac{\bar{X_1}-\bar{X_2}}{s_{\bar{\Delta}}}$

where

$s_{\bar{\Delta}} = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$

I already have the things inside the square root; they are the square of the estimated standard errors. So …

gala.coefs <- tidy(gala.1)
s_combined <- sqrt(z_estimates[4, "se_z"]^2 + gala.coefs[2, "std.error"]^2)
s_combined
##         se_z
## 1 0.04372116

which is ever so slightly bigger than the se from the model alone.

t <- (gala.coefs[2, "estimate"] - z_estimates[4, "mean_z"])/s_combined
t
##   estimate
## 1 2.452262

OK, to get a p value for that, I need to get the degrees of freedom … it’s ugly.

(z_estimates[4, "se_z"]^2 + gala.coefs[2, "std.error"]^2)^2 /
(z_estimates[4, "se_z"]^4/433 + gala.coefs[2, "std.error"]^4/28)
##    se_z
## 1 34.15

That makes sense … it is a little bigger than the sample size for the larger variance.

# have to coax t into a naked number, many ways to do this
pt(t[1,1, drop = TRUE], df = 34.15, lower.tail = FALSE)
## [1] 0.009730172

Nice – so that is a test of the hypothesis that the estimated coefficient from Galapagos plants is the same as the unweighted average for all independently sampled terrestrial systems. The p value is the probability of getting an estimate as large as 0.39 or larger, given the null hypothesis was true. So the Galapagos Islands plant communities are not like the other things. I’m not sure what to make of that ecologically, but it’s cool.

1. Code not shown in the post can be found on Github.↩︎