Here is an example of an exploratory analysis based on a
re-examination of some count data on seastars obtained before and after
a wasting syndrome. The original analysis was conducted and published by
Schultz (2016). Here is
the abstract from their paper.
Echinoderm population collapses, driven by disease outbreaks and
climatic events, may be important drivers of population dynamics,
ecological shifts and biodiversity. The northeast Pacific recently
experienced a mass mortality of sea stars. In Howe Sound, British
Columbia, the sunflower star Pycnopodia helianthoides—a
previously abundant predator of bottom-dwelling invertebrates—began to
show signs of a wasting syndrome in early September 2013, and dense
aggregations disappeared from many sites in a matter of weeks. Here, we
assess changes in subtidal community composition by comparing the
abundance of fish, invertebrates and macroalgae at 20 sites in Howe
Sound before and after the 2013 sea star mortality to evaluate evidence
for a trophic cascade. We observed changes in the abundance of several
species after the sea star mortality, most notably a four-fold increase
in the number of green sea urchins, Strongylocentrotus
droebachiensis, and a significant decline in kelp cover, which are
together consistent with a trophic cascade. Qualitative data on the
abundance of sunflower stars and green urchins from a citizen science
database show that the patterns of echinoderm abundance detected at our
study sites reflected wider local trends. The trophic cascade evident at
the scale of Howe Sound was observed at half of the study sites. It
remains unclear whether the urchin response was triggered directly, via
a reduction in urchin mortality, or indirectly, via a shift in urchin
distribution into areas previously occupied by the predatory sea stars.
Understanding the ecological implications of sudden and extreme
population declines may further elucidate the role of echinoderms in
temperate seas, and provide insight into the resilience of marine
ecosystems to biological disturbances.
Main results
Here is Figure 3 from Schultz et al. (2016), which shows the mean
counts of organisms before and after the wasting event. They used a
barplot with symmetrical errors bars based on a (in my opinion) flawed
model for counts. That is, the error bars are symmetric around the
means, which suggests that they might indeed observe negative
counts.
Figure 3. Mean abundance (per m2) of (A)
sunflower stars and (B) green sea urchins, and (C) percent cover of kelp
on rocky reefs in Howe Sound, British Columbia, on 80 transects before
and after the mass mortality of sea stars in 2013. Error bars represent
standard error. The dominant kelp was the sea colander kelp, Agarum
fimbriatum.
Models for count data
The before/after comparison of organism densities in Schultz et
al. is based upon discrete counts of individuals. Thus, we should really
consider statistical distributions that reflect the true nature of the
data rather than trying to transform the data and analyze it with a
continuous distribution (e.g., Gaussian).
Poisson distribution
The Poisson distribution is one of the most common ways to model
count data because it has only one parameter to estimate (i.e., the mean
are the variance are equal). If we expect, on average, \(\lambda_t\) individuals m-2 at
time \(t\), and we sample a total of
\(A\) m2, then each of \(i\) counts during that period (\(c_{i,t}\)) would reflect the following
Poisson process:
\[
c_{i,t} \sim \text{Poisson}(\lambda_t A).
\]
Because the rate constant \(\lambda_t\) must be greater than zero, one
generally assumes a log-link whereby
\[
\log(\lambda_t) \sim \text{Normal}(\mu, \sigma^2).
\]
Negative binomial distribution
An alternatve to the Poisson is the negative binomial distribution,
which has an additional parameter to account for overdispersion in the
data (i.e. the mean and variance are not equal). Specifically,
\[
c_{i,t} \sim \text{NegBin}(p, r).
\]
The shape (dispersion) parameter \(r\) must be greater than zero and can be
modeled with a uniform distribution, such that
\[
r \sim \text{Uniform}(0, U)
\]
The probability parameter \(p\) can
be written in terms of \(r\) and the
underlying mean \(\lambda_t A\),
whereby
\[
p = \frac{r}{r + \lambda_t A}
\]
and \(\log(\lambda_t) \sim
\text{Normal}(\mu, \sigma^2)\) as with the Poisson distribution.
Of note, the variance of the negative binomial distribution \(\sigma^2\) can be calculated as
\[
\sigma^2 = \lambda_t A + \frac{(\lambda_t A)^2}{r},
\] such that the variance converges to the mean as \(r \rightarrow \infty\), and the negative
binomial converges to the Poisson.
Aside on equations
The above equations are written in LaTeX form, which
can be a bit tricky to figure out when you’re first learning. There is a
neat package called {equatiomatic},
which will take the results from a fitted model in R
and transform it into the corresponding LaTeX code. For
example, this code fits a simple linear regression model from the
mtcars
dataset and then writes the equation.
library(equatiomatic)
## Fit a simple linear regression model
model_fit <- lm(mpg ~ cyl, mtcars)
## Pass the model to extract_eq
extract_eq(model_fit)
## raw result
## $$
## \operatorname{mpg} = \alpha + \beta_{1}(\operatorname{cyl}) + \epsilon
## $$
\[
\operatorname{mpg} = \alpha + \beta_{1}(\operatorname{cyl}) + \epsilon
\]
Requirements
I rely on a number of packages for reading, munging, and plotting the
data, which include:
In addition to some frequentist approaches, I also show different
options for Bayesian model fitting using the following software and
packages:
## for reading & writing data
library(here)
library(readxl)
library(readr)
## for data munging
library(dplyr)
library(tidyr)
## for plotting
library(ggplot2)
library(patchwork)
## for model fitting
library(R2jags)
Data
Schultz et al. were very considerate in posting the data they used
for the analyses and figures in their paper. Those data are available in
MS Excel format from PeerJ’s server in a file called Schultz_data_ver2.xlsx.
Step 1: Convert Excel workbook to csv files
The code
provided by the authors works with various data files in
.csv
format, so I begin by extracting those from the
.xlsx
file.
## data directory
data_dir <- here("lectures", "week_06", "data")
## original data file
orig_data <- here(data_dir, "Schultz_data_ver2.xlsx")
## worksheets in notebook
sht_names <- excel_sheets(orig_data)
## convert worksheets to csv
if(length(list.files(data_dir, "csv")) == 0) {
for(i in sht_names) {
orig_data %>% read_xlsx(sheet = i) %>%
write_csv(file = here(data_dir, i))
}
}
Step 2: Load count data
The data arise from a series of samples before and after the seastar
wasting event, with counts of various species obtained within 0.25
m2 quadrats at 15 locations along each of 4 transects at 20
different sites. The counts reported by the authors have been summed
across all of the 15 quadrats for each transect/site combination, so the
data frame has a total of (2 periods) x (4 transects) x (20 sites) = 160
rows.
## read count data
counts <- read_csv(here(data_dir, "transectcounts.csv"))
colnames(counts)
## [1] "ssws" "site" "transect"
## [4] "california.cucumber" "dungeness.crab" "green.urchin"
## [7] "grunt.sculpin" "longfin.sculpin" "misc.crab"
## [10] "octopus" "orange.cup.coral" "red.rock.crab"
## [13] "red.urchin" "sailfin.sculpin" "scalyhead.sculpin"
## [16] "sculpin" "shrimp" "spot.prawn"
## [19] "squat.lobster" "sunflower.star" "white.urchin"
## split out before/after counts of sunflower stars
## wide format
stars_wide <- counts %>%
select(ssws, transect, sunflower.star) %>%
spread(ssws, value = sunflower.star) %>%
select(-transect)
## tidy format
stars_tidy <- stars_wide %>%
gather(key = time, value = count)
Step 3: Plot the data
These data contain a lot of zeros, so visualization can be a bit
tricky. Here are two options: a jittered dot plot (left) and violin plot
(right). These are rough exploratory figures.
## base plot
pp <- ggplot(stars_tidy, aes(x = time, y = count, color = time))
## jittered dotplot
p1 <- pp +
geom_jitter(shape = 16, position = position_jitter(0.3), size = 2) +
theme_bw()
## violin plot
p2 <- pp +
geom_violin(size = 1.5) +
theme_bw()
## combine plots
p1 + p2 &
theme(legend.position = "none") &
theme(text = element_text(size = 20)) &
scale_x_discrete(limits = c("before", "after")) &
labs(x = "", y = "Count")
Analyses
Poisson models
We need to define the total survey area, so we can use it as an
offset in our statistical models.
## total survey area in m^2
area <- 3.75
Here’s the JAGS code for fitting a Poisson
distribution to the data.
## define Poisson model in JAGS
cat("
data {
N <- dim(stars_wide);
}
model {
## PRIORS
ln_lambda_bef ~ dnorm(0, 0.01);
ln_lambda_aft ~ dnorm(0, 0.01);
## DERIVED PARAMS
lambda_bef <- exp(ln_lambda_bef);
lambda_aft <- exp(ln_lambda_aft);
## LIKELIHOOD
for(i in 1:N[1]) {
stars_wide[i,1] ~ dpois(lambda_aft * area);
stars_wide[i,2] ~ dpois(lambda_bef * area);
}
}
", file="poisson_glm.txt") ## end model description
## data to pass to JAGS
dat_jags <- c("stars_wide", "area")
## model params for JAGS to return
par_jags <- c("lambda_bef", "lambda_aft")
## MCMC control params
mcmc_chains <- 4
mcmc_length <- 2e4
mcmc_burn <- 1e4
mcmc_thin <- 20
## total number of MCMC samples
mcmc_samp <- (mcmc_length - mcmc_burn) * mcmc_chains / mcmc_thin
## list of model info for JAGS
mod_jags <- list(data = dat_jags,
model.file = "poisson_glm.txt",
parameters.to.save = par_jags,
n.chains = as.integer(mcmc_chains),
n.iter = as.integer(mcmc_length),
n.burnin = as.integer(mcmc_burn),
n.thin = as.integer(mcmc_thin))
## fit model
po_fit_jags <- do.call(jags.parallel, mod_jags)
Examine parameter estimates
The first thing we can do is examine a summary table of the posterior
samples.
print(po_fit_jags)
## Inference for Bugs model at "poisson_glm.txt", fit using jags,
## 4 chains, each with 20000 iterations (first 10000 discarded), n.thin = 20
## n.sims = 2000 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## lambda_aft 0.057 0.014 0.034 0.047 0.056 0.065 0.086 1.003 910
## lambda_bef 0.423 0.039 0.354 0.397 0.421 0.447 0.505 1.001 2000
## deviance 710.207 2.080 708.207 708.737 709.557 711.059 715.769 1.008 2000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 2.2 and DIC = 712.4
## DIC is an estimate of expected predictive error (lower deviance is better).
Visual summaries of the results can also be useful. Here are
histograms of the posterior samples (left) and a summary of the median
and 95% credible interval of the posterior samples (right).
## gather posteriors samples
pdat <- data.frame(Time = rep(c("before","after"), ea = mcmc_samp),
samples = c(po_fit_jags$BUGSoutput$sims.list$lambda_bef,
po_fit_jags$BUGSoutput$sims.list$lambda_aft))
## summary of posteriors
pdat2 <- pdat %>%
group_by(Time) %>%
summarize(lo = quantile(samples, 0.025),
med = quantile(samples, 0.5),
hi = quantile(samples, 0.975))
## histogram of posteriors
p1 <- ggplot(pdat, aes(samples, fill = Time), size = 1.5) +
theme_bw() +
geom_histogram(bins = 100) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(x = expression(Density~(m^-2)), y = "Count")
p2 <- ggplot(pdat2, aes(x = Time, y = med, color = Time)) +
theme_bw() +
geom_point(aes(size = 1.5)) +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, size = 1.5) +
scale_x_discrete(limits = c("before", "after")) +
xlab("") +
ylab(expression(Density~(m^-2))) +
theme(legend.position = "none")
p1 + p2 & theme(text = element_text(size = 20))
Negative binomial models
Here is the JAGS code for fitting a negative
binomial model to the data.
## define negative binomial model in JAGS
cat("
data {
N <- dim(stars_wide);
}
model {
## PRIORS
r_bef ~ dnorm(0, 0.01) T(0,);
r_aft ~ dnorm(0, 0.01) T(0,);
lambda_bef ~ dnorm(0, 0.1) T(0,);
lambda_aft ~ dnorm(0, 0.1) T(0,);
## DERIVED PARAMS
mu_bef <- lambda_bef * area;
mu_aft <- lambda_aft * area;
p_aft <- r_aft / (r_aft + mu_aft);
p_bef <- r_bef / (r_bef + mu_bef);
mean_bef <- r_bef * (1 - p_bef) / p_bef;
mean_aft <- r_aft * (1 - p_aft) / p_aft;
var_bef <- mean_bef / p_bef;
var_aft <- mean_aft / p_aft;
## LIKELIHOOD
for(i in 1:N[1]) {
stars_wide[i,1] ~ dnegbin(p_aft, r_aft);
stars_wide[i,2] ~ dnegbin(p_bef, r_bef);
}
}
", file="negbin_glm.txt") ## end model description
## update model params for JAGS to return
par_jags <- c("lambda_bef", "mean_bef", "var_bef", "lambda_aft", "mean_aft", "var_aft")
## update list of model info for JAGS
mod_jags$model.file <-"negbin_glm.txt"
mod_jags$parameters.to.save <- par_jags
## fit model
nb_fit_jags <- do.call(jags.parallel, mod_jags)
Examine parameter estimates
Again we can examine a summary table of the parameter estimates.
Notice that in this case, we also have estimates for mean and variance
of the density of stars per 3.75 m2.
print(nb_fit_jags)
## Inference for Bugs model at "negbin_glm.txt", fit using jags,
## 4 chains, each with 20000 iterations (first 10000 discarded), n.thin = 20
## n.sims = 2000 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat
## lambda_aft 0.778 1.182 0.034 0.104 0.267 0.910 4.350 1.001
## lambda_bef 0.484 0.139 0.285 0.390 0.460 0.549 0.814 1.002
## mean_aft 2.919 4.431 0.127 0.391 1.001 3.414 16.314 1.001
## mean_bef 1.815 0.523 1.068 1.461 1.724 2.058 3.052 1.002
## var_aft 3482.222 13351.124 0.649 6.640 57.965 978.146 35273.228 1.001
## var_bef 18.659 16.543 5.942 10.413 14.665 21.957 53.184 1.001
## deviance 287.540 3.355 282.621 285.032 287.145 289.375 295.609 1.007
## n.eff
## lambda_aft 2000
## lambda_bef 1100
## mean_aft 2000
## mean_bef 1100
## var_aft 2000
## var_bef 2000
## deviance 490
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 5.6 and DIC = 293.1
## DIC is an estimate of expected predictive error (lower deviance is better).
This time I use a frequency plot instead of a histogram because the
overlap in the distributions makes them hard to discern otherwise.
## gather posteriors samples
pdat <- data.frame(Time = rep(c("before", "after"), ea = mcmc_samp),
samples = c(nb_fit_jags$BUGSoutput$sims.list$lambda_bef,
nb_fit_jags$BUGSoutput$sims.list$lambda_aft))
## summary of posteriors
pdat2 <- pdat %>%
group_by(Time) %>%
summarize(lo = quantile(samples, 0.025),
med = quantile(samples, 0.5),
hi = quantile(samples, 0.975))
## trim away big values for histogram
pdat <- pdat %>%
filter(samples <= 2)
## histogram of posteriors
p1 <- ggplot(pdat, aes(samples, color = Time)) +
theme_bw() +
geom_freqpoly(bins = 50, size = 1.5) +
guides(color = guide_legend(reverse = TRUE)) +
labs(x = expression(Density~(m^-2)), y = "Count")
p2 <- ggplot(pdat2, aes(x = Time, y = med, color = Time)) +
theme_bw() +
geom_point(aes(size=1.5)) +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, size = 1.5) +
scale_x_discrete(limits = c("before", "after")) +
labs(x = "", y = expression(Density~(m^-2))) +
theme(legend.position = "none")
p1 + p2 & theme(text = element_text(size = 20))