Chapter 6 How does dietary diversity affect populations?
6.1 About this Workflow Demonstration
This Workflow Demonstation (WD) is primarily a supplement to the material in the book Insights from Data with R by Petchey, Beckerman, Childs, and Cooper. If you don’t understand something here, have a look at that book (perhaps again), or search for help, or get in touch with us.
At the time of publication (early 2021) there are likely some improvements that could be made to this demonstration, in terms of the description of what we do in it, why we do it, and how we do it. If anything seems odd, unclear, or sub-optimal, please don’t hesitate to get in touch, and we will quickly make improvements.
6.2 Going to the next level
This case study includes a little bit more complexity than the first one (bat diets) and covers a few new topics in R and in general. The complexity mainly occurs in the calculations we do to obtain the response variables (i.e. the variables we examine to answer our biological question). The calculations involve some conceptually challenging manipulations, and the corresponding pipelines of dplyr functions will at first seem complex and daunting.
It’s fine to be concerned, even a bit stressed, by this complexity. If you feel so, just take a break, have a chat with friends, and while doing that think about how we only travel great distances and climb high peaks by taking one small step at a time. The same applies to what seem like complex and daunting tasks in getting insights from data. It’s amazing how far we get taking one step at a time. When we show you what seems like a complex task, like a complex pipeline of functions, break it down into its component steps. Let’s take a first step by introducing the case study and data that was collected.
Again: as you work through this chapter, try following along with the workflow/checklist here.
6.3 Introduction to the study and data
The data comes from a study of how the number of available prey species for a predator consume affects the stability of the predator population dynamics. The study was designed to address the hypotheses that more pathways of energy flow to a predator (i.e. more prey species) would stabilise the predator population dynamics, 1) by increasing the abundance of the predator, 2) by reducing how much it varies, and 3) by delaying extinction.
The experiment involved three prey species and one predator. One, two or three prey species were present, with every combination of each. The experimental units were small jars of nutrient containing liquid and the predator and prey species were aquatic microbes feeding on bacteria also present in the liquid. The small size of the prey and predators (about 20 to 400 microns) means that in just a few weeks individuals can reproduce many times, and also die. There were five replicates of each of the seven prey species compositions (so 35 experimental units).
During the experiment the data was written onto paper, then at a later date the data was entered into a spreadsheet. The data in the spreadsheet was checked for errors, by comparing it with the original paper copy of the data. The data recording and entry was done with tidiness and cleanliness in mind, so tidying and cleaning will be minimal. The data file contains the data exactly as it was recorded on the paper data sheet.
Looking at the dataset in Excel or something similar, you will see the following variables:
- day - the day of the experiment the observation in the row belongs to.
- count - the number of individuals counted in the sample; used to calculate density per ml. Counts were made by eye, looking down a microscope.
- w1 - the volume of liquid in ml removed from the community for sampling abundance. Used to calculate density per ml.
- w2 - the volume of diluent added to the volume sampled, plus the volume sampled. Used to calculate density per ml.
- w3 - the volume of the sub-sample taken of the mixture of diluent and sample removed. Used to calculate density per ml.
- species - a numeric code for the species that the observation in the row belongs to.
- prey.richness - the number of prey species added to the jar.
- prey.composition - a number that identifies the combination of prey species added to the jar.
- replicate - a replicate identifier, unique within prey composition.
- jar - a number that uniquely identifies each experimental unit.
How does the data measure up to the features of data we mentioned in the first chapter?
- Number of variables. Not too many. Really there are only two or three explanatory variables, and one response variable (population size, though we will derive three separate response variables). Certainly we are not up in the range of tens of explanatory variables. And because this was an experiment, we only have variables already thought to be important.
- Number of observations. Not too many. Just 35. Our ability to detect patterns is going to be low.
- If and how many of the variables describe experimental manipulations. Yes the data are from an experiment. Prey composition was manipulated, and so therefore was prey richness. So we know that any apparent relationships between the predator population and the prey composition is a result of a causative relationships. I.e. if we alter prey composition and only prey composition (as is the case here) and see changes in the predator population then we know that prey composition caused those changes.
- The amount of correlation among the variables. There is no correlation to speak of; this is not so uncommon in an experiment.
- How independent are the observations. Here we need to be a bit careful. There were 35 experimental units, so there should be 35 data points in our analyses. Multiple samples through time were taken from each experimental unit, however—we have repeated measures. We will take care of this, and not fall into a trap of trying to interpret patterns in graphs with lots of non-independent data points.
6.4 What type of response variable?
This question is worth asking at the beginning of a study. There were, in fact, three main response variables used in this study and these correspond to the three hypotheses mentioned above. I.e. there is one response variable for each of the three hypotheses/questions. These response variables were: 1) the population size of the predator population, 2) the temporal variability of the predator population size, and 3) the amount of time for which the predator population persisted. Each of these is an component/influence on the stability of the predator population. Each will have to be calculated from the time series of predator population density (number of individuals predators per ml) in each replicate. First though we will have to calculate population density from count
, w1
, w2
, and w3
, because during the experiment the volume of liquid in which the number of individuals of each species was counted varied (i.e. w1
, w2
, and w3
are not constant). The reason for this variation is that dilution is required when populations are very dense, otherwise the dense swarm of individuals would be near impossible to count. But first…
6.5 A little preparation
Make a new folder for this case study, and in it a folder called data
(this will contain any datasets for this case study/project). In RStudio create an RStudio project and save it in the project folder (not the data
folder). Close RStudio and double click on the project file to re-open RStudio. Create a new R script file and load the libraries we will use:
# Load the libraries we use
library(dplyr)
library(ggplot2)
library(readr)
library(stringr)
library(lubridate)
library(tidyr)
library(forcats)
Install the forcats
library if you have not already done so, otherwise library(forcats)
will not work. Look at the Packages section of the Getting Acquainted chapter if you need help with installation.
6.6 Acquire the dataset
Get the data file dileptus expt data.csv
from the dryad repository associated with the publication about the data (make sure you get version 2 of the dataset). The dataset is stored on dryad in comma separated value (csv) format. Put the file in your data
folder for this project.
6.7 Import the dataset
Import the dataset using the read_csv()
function from the readr
package.
Make sure you opened RStudio by clicking on your project file for this case study (or switch to the RProject in the top right of the RStudio window. This will ensure RStudio is looking at the correct folder, and that the relative path
used before the file name will be correct:
<- read_csv("data/dileptus_predator_prey_data.csv") stab
##
## ── Column specification ───────────────────────────────
## cols(
## day = col_double(),
## count = col_character(),
## w1 = col_double(),
## w2 = col_double(),
## w3 = col_double(),
## species = col_double(),
## prey.richness = col_double(),
## prey.composition = col_double(),
## replicate = col_double(),
## jar = col_double()
## )
Great, we didn’t get an error. The read_csv()
function has done two things. It read in the data and the stab <-
assigns the data to a new object called stab
(you could give it a name other than stab
if you wish).
The read_csv()
function also gives a summary of what it found.
6.8 Checking the import worked correctly
Now we should check some basic things about the imported data. Let’s start by checking if it has the correct number of rows and variables, and if variables are the expected (i.e. correct type). We can look at these features of the imported data in many ways. One useful method is to type the name of the data object in the Console and press enter:
stab
## # A tibble: 806 x 10
## day count w1 w2 w3 species prey.richness
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 52 0.356 6.32 0.322 1 1
## 2 2 0 0.356 1 1 4 1
## 3 2 66 0.360 6.56 0.323 1 1
## 4 2 0 0.360 1 1 4 1
## 5 2 21 0.303 5.67 0.0843 2 1
## # … with 801 more rows, and 3 more variables:
## # prey.composition <dbl>, replicate <dbl>, jar <dbl>
We are told the object is A tibble: 806 x 10
. This is what we expect if we look at the datafile. All looks good so far then.
Not all is well, however, as some of the variables that should be numeric are character, and vice versa. We expect count
to be numeric. Looking that the count
variable we see some entries are a full stop. This is because the dataset was originally made to be used by software (SAS) which by default uses the full stop to indicate a missing value. So we need to repeat the import the a full stop as the missing value indicator:
<- read_csv("data/dileptus_predator_prey_data.csv", na = ".") stab
##
## ── Column specification ───────────────────────────────
## cols(
## day = col_double(),
## count = col_double(),
## w1 = col_double(),
## w2 = col_double(),
## w3 = col_double(),
## species = col_double(),
## prey.richness = col_double(),
## prey.composition = col_double(),
## replicate = col_double(),
## jar = col_double()
## )
Good, now count
is a numeric type variable.
6.9 Cleaning and tidying
6.9.1 Recode some names
We expect the species
and the prey.composition
variables should probably be character variables. Indeed, they really should be characters, but they appear to be integers. Always use the proper word when you have one —don’t ever use codes like this (the person who entered this data was probably not thinking about this carefully enough! Or was constrained by time. Or both.). Let’s fix this by replacing the species code numbers with their names (we found the correct names for each code in the lab book for this experiment; they are not in the dataset):
<- stab %>%
stab mutate(species = recode(species,
"1" = "Colpidium",
"2" = "Collodictyon",
"3" = "Paramecium",
"4" = "Dileptus"))
Here we do a mutate
to make a new version of the species
variable, and the recode
function to change the numbers to the species names. We first give recode
the variable to work on (species
) and then tell it to change each 1 to “Colpidium” and so on for the other numbers.
Next we work on the prey.composition
variable. In fact, we’re just going to throw it away as it is redundant, because we can construct it from the species variable. Let’s do that. While we’re at it, let’s also get the prey richness from the other columns, rather than use the existing imported prey.richness
column.
The first thing we’ll do is make a new dataset containing these two new variables, community_composition
and prey_richness
, and we’ll get these variables for each prey composition and replicate combination:
<- stab %>%
comm_comp group_by(prey.composition, replicate) %>%
summarise(community_composition = paste(unique(species), collapse = ", "),
prey_richness = length(unique(species)) - 1)
## `summarise()` has grouped output by 'prey.composition'. You can override using the `.groups` argument.
Let’s break this down. We’re doing a group_by
—summarise
. The group_by
is by prey.composition
(seven compositions) and replicate
(five replicates), so we’re expecting 7 * 5 = 35 rows in this new dataset. Then in the summarise
we create in the first line the new community_composition
variable, and in the second the new prey_richness
variable. To make the new community_composition
variable we paste together the species present in the group (composition by replicate combination); we use unique
to remove repeats of the species names, and collapse
to make the species names in a group be pasted together into one string. Take a look at the new community_composition
variable, and then look again at the code above and the description of how we just made this new prey_composition
variable.
%>%
comm_comp select(community_composition)
## Adding missing grouping variables: `prey.composition`
## # A tibble: 35 x 2
## prey.composition community_composition
## <dbl> <chr>
## 1 1 Colpidium, Dileptus
## 2 1 Colpidium, Dileptus
## 3 1 Colpidium, Dileptus
## 4 1 Colpidium, Dileptus
## 5 1 Colpidium, Dileptus
## # … with 30 more rows
We also make the new prey_richness
variable. The code was prey_richness = length(unique(species)) - 1
. The length
function tells us the number of unique (because we used the unique
function) species, and we subtract the 1 to remove the 1 added by the predator species being present.
Next we do more or less the same thing to get a new variable prey_composition
. This will be the same as the community composition variable we just made, but without the predator species name included. The code is more or less the same, but with the predator (Dileptus) rows filtered out before the summarise
.
<- stab %>%
prey_comp group_by(prey.composition, replicate) %>%
filter(species != "Dileptus") %>%
summarise(prey_composition = paste(unique(species), collapse = ", "))
## `summarise()` has grouped output by 'prey.composition'. You can override using the `.groups` argument.
The final thing to do is to add the new variables to the original dataset. We do this by merging (joining) the datasets with the new variables into the original dataset, one then the other, then we clean a bit by removing the old prey.composition
and prey.richness
variables, and by removing the comm_comp
and prey_comp
datasets which were only of temporary use.
<- full_join(stab, comm_comp) %>%
stab full_join(prey_comp) %>%
select(-prey.composition, -prey.richness) %>%
ungroup()
## Joining, by = c("prey.composition", "replicate")
## Joining, by = c("prey.composition", "replicate")
rm(comm_comp, prey_comp)
It’s not necessary to remove variables and datasets that we’re not interested in using anymore, but is often a good idea to do so. It can prevent us accidentally using them, or just from getting confused about what they are and if they are in fact important when we come back to our work after a gap of a few months, as often happens. If we have a rather large dataset, keeping only the essential variables can usefully reduce the size of the dataset.
6.9.2 Make the prey_composition
variable a factor with specific order
What on earth did that section title mean? So far we have not mentioned factors. These are a type of variable that contains words (a bit like character variables) but in which the different words/categories are referred to as levels and have an explicit order. This order affects how some things are done in R, and this is why we want to determine the specific order of the levels of the prey_composition
variable. For example, when we make a graph with prey_composition
on the x axis we would like the three single-species prey compositions first on the left, then the three two-species prey compositions, then the three-species prey composition on the right.
Before we do this, look at the prey_composition
variable in its current state:
%>%
stab select(prey_composition)
## # A tibble: 806 x 1
## prey_composition
## <chr>
## 1 Colpidium
## 2 Colpidium
## 3 Colpidium
## 4 Colpidium
## 5 Collodictyon
## # … with 801 more rows
It is a character variable (<chr>
). If we try to see what the levels of this variable are, we get NULL
; there are no levels:
%>%
stab pull(prey_composition) %>%
levels()
## NULL
To change this to an ordered factor type variable, we use the fct_relevel
function from the forcats package inside a mutate
change the prey_composition
variable:
<- stab %>%
stab mutate(prey_composition =
fct_relevel(prey_composition,
"Collodictyon",
"Colpidium",
"Paramecium",
"Colpidium, Collodictyon",
"Colpidium, Paramecium",
"Collodictyon, Paramecium",
"Colpidium, Collodictyon, Paramecium")
)
Now let’s see how the prey_composition
variable looks:
%>%
stab pull(prey_composition) %>%
levels()
## [1] "Collodictyon"
## [2] "Colpidium"
## [3] "Paramecium"
## [4] "Colpidium, Collodictyon"
## [5] "Colpidium, Paramecium"
## [6] "Collodictyon, Paramecium"
## [7] "Colpidium, Collodictyon, Paramecium"
Excellent. The variable is now a factor (<fct>
) and we can see its levels.
Do not worry if it’s not immediately apparent why we made this variable an ordered factor—we emphasise the point a couple of times below where the benefits of this are very clear.
6.9.3 Fix those variable names
We don’t have any spaces, brackets, or other special characters in variable names, and there is no date variable we need to correctly format. Great – the person who entered the data did something right!
6.9.4 Calculate an important variable
When we first introduced the data, we identified four variables that would be used to calculate the number of individual organisms per ml. We must do this because different volumes and dilutions of liquid were counted at different times in the experiment. Standardising the abundance measures to number of individuals per ml will also allow interoperability with data from other studies that have also been standardised.
Look at the definitions of the four variables count
, w1
, w2
, and w3
and see if you can figure out how to convert the count to per ml. It’s not easy, but not too difficult perhaps. Really, give it a go. The challenge of getting insights from data is in large part about being able to solve problems like these. After solving them on paper we can then move to R to implement the solution.
Start by simplifying the problem. Imagine we made no dilution of w1
such that w2 = w1
and counted all of w1
such that w3 = w1
. In effect we just took the volume w1
from the experimental culture and counted the number of individuals in it. If w1
is 1 ml, then then the number of individuals counted (in variable count
) is per ml. If w1
is 0.5ml, we need to multiple the count by 2 (i.e. divide by 0.5) to get to per ml. If w1
is 2 ml we need to divide by 2 to get to per ml. In this case we could use the formula count / w1
to convert to number of individuals per ml.
Now think about if we sampled 1 ml (w1
= 1 ml), added 1 ml of diluent (w2
= 2 ml), and then counted 1 ml of that diluted sample (w3
= 1). We have diluted by a factor of 2 (w2 / w1
= 2) and counted 1 ml of that dilution, so we need to multiply the count by 2 to get to per ml. That is, we multiply the count by the dilution factor to get the count per ml (so long as w3
= 1 ml).
Finally, think about if we count only 0.1 ml of the diluted sample (w3
= 0.1 ml). That is a tenth of a ml, so we need to multiply by ten (or divide by 0.1). That is, we divide by w3
to correct to count per ml when w3
is not 1 ml.
Putting these parts together we see that we multiply by the dilution factor w2/w1
and then divide by the volume counted (w3
). We, of course, add the new dens_per_ml
variable and calculation using the mutate
function:
<- stab %>%
stab mutate(dens_per_ml = count * w2 / w1 / w3)
Super-simple! Now we have a standardised, comparable, and interoperable measurement of population size of each species.
Actually, the data entry was done so to make this calculation work very easily. In particular, w2
is very purposefully the sum of the sampled volume and the diluent, and is not just the volume of the diluent. Furthermore, if there was no dilution, and the count was made of w1
we might have entered NA
for w2
and w3
since they did not exist. Instead, the both w2
and w3
are set to 1 in this case, which means they have no effect in the calculation (multiplication and division by 1 has no effect). This is an example of designing and planning before and during data entry with downstream calculations in mind.
6.9.5 Remove NAs
When we imported the data we noticed some values of count
were missing (NA
). The reason these particular counts are NA while there are still values of w1
, w2
, and w3
is a bit of a puzzle. One might imagine that the NA
should really be a zero. In fact, what the researcher did was to sometimes enter NA
after a species had become apparently extinct, because after not seeing individuals of a species for several days there was no attempt to observe that species, even though a sample was taken to observe any other still extant species in that community. So these NA
values are when a species has gone extinct, and so could be zero, but since no count was attempted, it was appropriate to enter missing value NA
. A consequence of all that is that we can remove all observations/rows containing an NA
.
We can do this with a filter
so that we remove only rows with an NA
in the count
variable:
<- stab %>%
temp1 filter(!is.na(count))
nrow(temp1)
## [1] 680
Or remove we can remove rows with and NA
in any of the variables:
<- stab %>%
temp2 na.omit()
nrow(temp2)
## [1] 680
Since the number of remaining rows is the same for both approaches we know there we no observations in which a variable other than count
contained an NA
when the count
variable did not. So we’re safe doing it either way. Let’s overwrite stab
with a new version with no NA
values.
<- stab %>%
stab na.omit()
6.9.6 Checking some specifics
We should check some feature of the dataset specific to the study.
Which days were samples taken on? To find this we pull
the day
variable from the tibble stab
and get its unique
values:
%>%
stab pull(day) %>%
unique()
## [1] 2 4 6 8 10 12 14 16 18 20
Samples were taken every two days from day two until a maximum of day 20. So we expect a maximum number of counts of species abundances in a particular replicate to be 10 multiplied by the number of species in a replicate community. I.e. the maximum number of counts in a community with two prey species and the predator (recall that all communities contained the predator) will be 30. Maximum number of counts will be 20 for communities with only one prey species. Note that this is the maximum because we know some values of count were NA
.
Let’s now get the number of counts for each replicate community, and while we’re at it, the last day sampled for each replicate. We do this with a group_by %>% summarise
pipeline, grouping by the jar
and prey_composition
variables. We include the prey_composition
variable in the grouping so that we have it to refer to in the resulting summary dataset, and place it first in the group_by
function so that the summary dataset is arranged by this variable.
<-stab %>%
temp1 group_by(prey_composition, jar) %>%
summarise(num_samples = n(),
max_day = max(day))
## `summarise()` has grouped output by 'prey_composition'. You can override using the `.groups` argument.
temp1
## # A tibble: 35 x 4
## prey_composition jar num_samples max_day
## <fct> <dbl> <int> <dbl>
## 1 Collodictyon 3 11 12
## 2 Collodictyon 4 11 12
## 3 Collodictyon 18 14 14
## 4 Collodictyon 19 14 14
## 5 Collodictyon 20 4 4
## # … with 30 more rows
Notice how the dataset is arranged in the same order as the levels of the prey_composition
variable that we specified when we made this variable an ordered factor type variable. The order is as we planned. If we had not made the prey_composition
variable into an ordered factor, R would have chosen its preferred arrangement/ordering, and often this is not our preference. You can see R’s preferred arrangement if you replace prey_composition
with as.character(prey_composition)
in the group_by
.
Looking at the summary tibble temp1
(remember you can use View(temp1)
or click on temp1
in the Environment tab) we see that each of the prey compositions has five replicates, exactly as expected. And, also as expected, the maximum number of samples is 10 times the number of species, though often this maximum was not realised. We also see that some communities were sampled for longer than others (we assume this is due to sampling ceasing when all species went extinct).
We can have R confirm the number of replicates per prey composition with a quick group_by %>% summarise
pipeline working on temp1
:
%>%
temp1 group_by(prey_composition) %>%
summarise(number_of_replicate = n())
## # A tibble: 7 x 2
## prey_composition number_of_replicate
## <fct> <int>
## 1 Collodictyon 5
## 2 Colpidium 5
## 3 Paramecium 5
## 4 Colpidium, Collodictyon 5
## 5 Colpidium, Paramecium 5
## # … with 2 more rows
We can also get the number of counts made of each species by using the same code with species
added as a grouping variable:
<- stab %>%
temp2 group_by(prey_composition, jar, species) %>%
summarise(num_samples = n(),
max_day = max(day))
## `summarise()` has grouped output by 'prey_composition', 'jar'. You can override using the `.groups` argument.
temp2
## # A tibble: 95 x 5
## prey_composition jar species num_samples max_day
## <fct> <dbl> <chr> <int> <dbl>
## 1 Collodictyon 3 Collodict… 5 10
## 2 Collodictyon 3 Dileptus 6 12
## 3 Collodictyon 4 Collodict… 5 10
## 4 Collodictyon 4 Dileptus 6 12
## 5 Collodictyon 18 Collodict… 7 14
## # … with 90 more rows
Here we see that two of the microcosms had only two samples taken. These are the ones in which the last day sampled was day four (confirm this for yourself if you wish; View(temp2)
is useful for this).
All this might seem like overkill, particularly if we imagine that we had collected and entered the data into the spreadsheet. Surely then we would know all this, and therefore not have to check it? It’s not overkill, however. It’s about making efforts to ensure that we and R are in perfect agreement about the nature and contents of the data. Reaching such agreement is absolutely essential, and often involves resolving differences of opinion by highlighting hidden (and potentially dangerous) assumptions made by us and/or by R about the nature and contents of the data. By taking such efforts we are investing in the foundations of our insights. We need solid foundations for reliable, robust, and accurate insights.
6.9.7 A closer look at the data
We so far have not looked at the species abundances and have not yet calculated the three required response variables. Let’s begin by looking at the species abundances, and do this by plotting the dynamics of the species in each microcosm. Now you will begin to see the power, elegance, and amazingness of ggplot
. We will map the log 10 transformed abundance data onto the y-axis. We will map the day
variable onto the x-axis. We will plot the data as points and lines (so use geom_line
and geom_point
) and we will map the species
variable onto the colour of the points and lines. And we will make a grid of facets using facet_grid
with rows of the grid corresponding to prey composition and columns for the replicates. Translating that into ggplot
terms we get the following code (producing Figure 6.2):
%>%
stab ggplot(aes(x = day, y = log10(dens_per_ml), col = species)) +
geom_point() +
geom_line() +
facet_grid(prey_composition ~ replicate)
Notice that we gave the aesthetic mappings as an argument to the ggplot
function (we have mostly been giving the aesthetic mappings to the geoms). We give the mappings to ggplot here because we have two geoms… and we would like both to use the same mappings. And because we don’t give them any mappings, they are inherited from the mappings giving to ggplot. This "inheritance is explained and illustrated in more detail in the book.
Notice the order of the rows of facets in the graph—it is from top to bottom the order we specified when we made the prey_composition
variable an ordered factor. So ggplot
is using this order. If you like to see what the order would have been if we hadn’t specified it, replace prey_composition
with as.character(prey_composition)
.
The dynamics in Figure 6.2 are those shown in figure 1 of the original publication. Everything seems reasonable, though I (Owen) wish I’d recorded more information about why I stopped sampling the microcosms when I did.
The predator (Dileptus) does not seem to grow well when Collodictyon is the only available prey species, but grows very well on and decimates Colpidium, after which the predator appears to go extinct and then the Colpidium population recovers. We can also notice that the predator population persists to the end of the experiment on day 20 only when all three prey species are present. It is probably worth emphasising that 20 days corresponds to around 20 generations of these organisms; this is a lot of time for birth and death to affect population abundances (if each individual divided into two daughters each day, the population would increase in size by a factor of 2^20, i.e. grow to be 1 million times larger).
6.9.8 Calculate the three response variables
There are three response variables corresponding to the three hypotheses about why increased prey diversity might increase predator population stability:
- Predator population abundance.
- Predator population variability.
- Predator population persistence time.
6.9.8.1 Calculating predator (Dileptus) abundance
You might be thinking “we already have predator abundance,” and you’d be correct. We have estimates of the number per ml on each of the sample days. The issue that we now have to deal with, and solve somehow, is that of independence of observations. When we take repeated measures of the same thing, here the same predator population in the same experimental community, the repeated measures are not independent, and measures closer together in time are less independent than are those further apart in time. There are some rather complex and perhaps sophisticated ways of accounting for the non-independence in repeated measures, but they are beyond the scope of what we want to cover. Instead we are going to use a rather crude, but also quite valid solution. That is to use the maximum predator abundance in each of the experimental unit (communities). There are 35 experimental communities, so we expect to have 35 estimates of average predator abundance.
To make the calculation in R we will use a filter %>% group_by %>% summarise
pipeline. We filter to keep only the observations of the predator (Dileptus), then group the data by prey_richness
, prey_composition
, and replicate
, then calculate two summary values. The first summary value, named by us pred_max_dens
is the maximum density of the predator population observed in each experimental community. The second variable calculated, named by us all_zero
is TRUE if all the predator densities were zero, or is otherwise FALSE. Here is the R code:
<- stab %>%
pred_max_dens filter(species == "Dileptus") %>%
group_by(prey_richness, prey_composition, replicate) %>%
summarise(max_dens = max(dens_per_ml),
all_zero = sum(dens_per_ml > 0) == 0)
## `summarise()` has grouped output by 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
As expected, the resulting tibble contains 35 observations and four variables. Lovely. We will look at the values after we calculate the other two response variables.
In the group_by
we included a redundant grouping variable prey_richness
. It is redundant in the sense that we would have got the same resulting tibble if we had not included it, the only difference being that the resulting tibble would not have contained the prey_richness
variable. We will often find ourselves including redundant variables in a group_by
because we would like to them to appear in the resulting tibble so we can use them later.
We calculated the maximum of the predator abundance, but could have chosen the mean, median, or the geometric mean, or harmonic mean. It is really important to recognise that the choice of mean here is quite arbitrary, and so introduces an element of subjectivity into our journey from data to insights. Such subjectivity is a rather common feature of this journey.
We must make sure our insights are robust to subjective decisions about how to reach them. We can do this by looking at whether the insights are sensitive to changes in these decisions. This is quite different, however, from selecting the decision that gives a result we might desire, which would clearly be a totally inappropriate practice.
6.9.8.2 Calculating predator (Dileptus) population variability
We will use the coefficient of variation as the measure of predator population variable. The coefficient of variation, or CV for short, is the standard deviation of the abundances divided by the mean of the abundances. What values might we expect? Well, one thing to expect is some undefined values, since we know that in six of the communities there were no predators observed. The mean will therefore be zero, and the result of division by zero is undefined. So we expect the CV to be NaN
for the six communities in which the predator was never observed.
The R code is very similar to that used in the previous section to calculate maximum predator abundance:
<- stab %>%
pred_CV filter(species == "Dileptus") %>%
group_by(prey_richness, prey_composition, replicate) %>%
summarise(CV = sd(dens_per_ml) / mean(dens_per_ml),
all_zero = sum(dens_per_ml > 0) == 0)
## `summarise()` has grouped output by 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
Looking at the data (e.g. with View(pred_CV
) shows us that indeed there are six NaN
values of CV. We will look at the other CV values after we calculate the third response variable.
Congratulations if you noticed that we could have calculated maximum abundance and CV of abundance in the same summarise. This would have been more efficient, in the sense that we would have had less code. It would also have been safer since then we could not have accidentally done something different (e.g. a different grouping) for the two.
6.9.8.3 Calculating predator (Dileptus) persistence time
Before we do, what type of variable is this? One characteristic is that it is “time to event” data… we are observing the amount of time taken for a particular event (here extinction) to occur. The values can’t be negative. They could be continuous, but in this case are not, because we sampled only every other day. Hence the variable is discrete numbers. We will go into a bit more detail about this nature of values of persistence time after we calculate them.
We get the last day on which Dileptus was observed with a filter %>% group_by %>% summarise
pipeline. We filter to retain only observations of Dileptus being present (i.e. we exclude observations when Dileptus had zero count/density per ml). Then we group the data by prey richness, prey composition, and replicate (prey richness is redundant but we want to keep it to work with later). Then we use summarise
to get the maximum value of the day
variable in the filtered data. Translated into R this looks like:
<- stab %>%
temp_persistence_time filter(species == "Dileptus", dens_per_ml != 0) %>%
group_by(prey_richness, prey_composition, replicate) %>%
summarise(last_day_observed = max(day))
## `summarise()` has grouped output by 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
Note that the created tibble has only 29 observations, whereas there were 35 experimental communities. This difference occurs because Dileptus was never observed in six of the experimental communities and so these are filtered out of this analysis. We will add them in below as a side effect of another manipulation, so don’t need to take care of this further here.
Next we need to recognise that the last day on which Dileptus was observed might have been the last day observed because it was after that observed to be extinct, or that it was never sampled again (e.g. if the last day observed was day 20). If we stop an experiment before we observe the event being watched for (e.g. before we observe extinction) then we say that the observation is censored.
We really should include this information in the persistence time dataset. To do so we need to get and add the day of last sample of Dileptus. We do this with the same pipeline as we used to get the last day observed, except we don’t exclude the observations when Dileptus was sampled and found to have zero abundance:
<- stab %>%
pred_last_day filter(species=="Dileptus") %>%
group_by(prey_richness, prey_composition, replicate) %>%
summarise(last_sample=max(day))
## `summarise()` has grouped output by 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
The created tibble has the expected number of observations: 35.
We then join the two created tibbles and create a new variable that contains TRUE
if the observation of persistence time was censored (indicated if the persistence time is the same as the last day sampled) or FALSE
if a sample was taken after the observed persistence time (i.e. the observation is not censored).
<- temp_persistence_time %>%
pred_persistence_time full_join(pred_last_day) %>%
mutate(censored=ifelse(last_day_observed==last_sample, T, F))
## Joining, by = c("prey_richness", "prey_composition", "replicate")
Because we used full_join
we have 35 observations in the resulting tibble, with six NA
values of persistence time (and also therefore censored
) corresponding to the six communities that are missing from the persistence_time
tibble because Dileptus was never observed in those communities.
Censored observations in time to event data Often we do not have the exact time at which the event of interest occurred. In this experiment we sampled only one every two days. If we observed some individuals on day 14, for example, and then none on or after day 16, we know that extinction occurred somewhere in the time interval between day 14 and day 16. Thus we have what is called “interval censored” time to event data. That is, unless extinction was not observed by the last sample, in which case the data are right censored, and we can only say that extinction (since it is at some point inevitable) will happen sometime after the day of the last sample. Censoring of time to event data is very common and is formally accounted for in statistical analyses of time to event type data. It certainly complicates and adds potential for misinterpreting visualisation of the data, so we must then at least differentiate data points by the type of censoring they have.
6.9.8.4 Put them all together
Let’s put all three response variables into the same tibble, since it will make some of the subsequent things easier and more efficient. We want them all in the same column. “But wait,” you say! If all three response variables are in the same column/variable, how will we know which is which. Easy—we make another variable that tells us the response variable in each row. The new tibble will have 35 * 3 = 105 rows. There are lots of ways to do this. The approach we show is to merge/join the three tibbles containing each of the response variables (pred_max_dens
, pred_CV
, and pred_persistence_time
) which creates a wide-formatted arrangement of the data (a different variable for each of the response variables). And then we’ll transform it into tidy (long) format by gather the information from each of the three separate response variables into one variable.
The first step is to merge all three tibbles:
<- pred_max_dens %>%
wider_rv full_join(pred_CV) %>%
full_join(pred_persistence_time)
## Joining, by = c("prey_richness", "prey_composition", "replicate", "all_zero")
## Joining, by = c("prey_richness", "prey_composition", "replicate")
We get told both joins are being done by prey_richness
, prey_composition
, and replicate
and also all_zero
for the first of the two joins. Lovely—thanks very much. (The join is done by these because R looks for variables shared by the two tibbles being joined, and by default joins by these.)
Next step is to gather the three response variables into one variable:
<- wider_rv %>%
rv pivot_longer(names_to = "Response_type",
values_to = "Value",
cols = c(max_dens, CV, last_day_observed))
We tell pivot_longer
that we’d like the values in the variables max_dens
, CV
, and last_day_observed
gathered together an put into one variable called Values
and that we’d like the key
variable, i.e. the one that says which response variable each observation is about, to be named Response_type
. If how pivot_longer
works is unclear then please refer to the section of the book that covers this.
Excellent. We now have a tibble rv
with 105 observations (as expected), a variable named Response_type
with entries max_dens
, CV
, and last_day_observed
, and the values of each of these in the Value
variable. We have the other variables all_zero
, last_sample
, and censored
also (which is just fine).
6.10 Shapes
Let’s get the mean, standard deviation, and number of observations of each of the response variables for each prey composition. You have seen how to do this before. Take a look back through the book/your notes/your brain to remind yourself.
Yes… it’s a pretty straightforward group_by %>% summarise
pipeline:
<- rv %>%
summry_rv group_by(prey_richness, prey_composition, Response_type) %>%
summarise(mean = mean(Value, na.rm = TRUE),
sd = sd(Value, na.rm = TRUE),
n = sum(!is.na(Value)))
## `summarise()` has grouped output by 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
The resulting tibble summry_rv
has 21 observations (7 prey compositions by 3 response variables) and six variables, three of which are the calculated mean, standard deviation, and number of non-NA values.
We counted the number of non-NA values by asking which of the entries in Value
were not NA using !is.na
, which gives a series of TRUE and FALSE values. Then we use sum
to count the number of TRUEs.
Take a look at the resulting tibble (View(summry_rv)
). Can you see any patterns? Do you get clear insights? Probably not. The lesson here is tables are quite difficult to interpret. We need some visualisations of relationships between the response variables differ and the treatments. Before we do this, let’s have a look at the distributions of the response variables. Try to think about how you would produce graphs of the frequency distributions of each of the response variables.
Here is how we did it (producing Figure 6.3):
%>%
rv ggplot() +
geom_histogram(aes(x = Value), bins = 5) +
facet_wrap(~ Response_type, scales = "free")
## Warning: Removed 12 rows containing non-finite values
## (stat_bin).
First note that we are told that 12 non-finite values have been removed and therefore are not in the visualisation. These are the CV
and last_day_observed
for the six communities were no predators were ever observed.
What do you think about the shapes of these distributions? Maximum density looks quite skewed, with lots of small values and few large. This might be expected for an abundance type variable, since growth is a multiplicative process. We would expect a more symmetric distribution of logged abundance (though see Figure 6.4).
## Warning: Removed 6 rows containing non-finite values
## (stat_bin).
There are two reasons why we need to be cautious, however, about interpreting much about these distributions. First, they each contain data from seven different treatments (prey_composition
)—only five of the values in each distribution might be expected to come from the same population. In effect, we might be looking at seven distributions superimposed on top of each other. Second, sure, we could plot a separate distribution for each of the prey compositions, but we would only have five values per distribution, and that is not enough to say much about their shapes. I.e. we could not say with any confidence if they are symmetric of not. All is not lost, however, as you will see in the next section in which we look at relationships.
6.11 Relationships
Recall the question we’re interested in. Do predators have more stable populations when they have more prey to feed on. We expected this to be so if maximum predator population was greater, if predator population variability was lower, and if predator population persistence was longer when there were more prey available. For each of these we must look at the relationship between the response variable and prey richness (the explanatory variable). We can also look at prey composition as an explanatory variable, to see if there is much difference between effects of prey communities with same number of prey species, but different compositions of prey species.
6.11.1 Maximum predator density
Recall that we have one tibble rv
with all three response variables. Clearly then, if we want to plot data about only maximum prey density we will need to filter the data according to Response_type == "max_dens
. We then do a pretty straightforward ggplot
with prey_richness
mapped onto the x-axis, the Value
of maximum density mapped onto the y-axis, and then a geom_jitter
with jitter only along the x-axis (width
). Here’s the code, and the result is in Figure 6.5:
There is a big problem with this. Our figure does not tell us what is on the y-axis (“Value” is not particularly informative!) We could fix this by changing the y-axis like this (no figure is presented for this code):
%>%
rv filter(Response_type == "max_dens") %>%
ggplot() +
geom_jitter(aes(x = prey_richness, y = Value, col = prey_composition),
width = 0.1, height = 0) +
ylab("max_dens")
This is a bit dangerous, in fact, because we have to make sure we have max_dens
written in two places. If we accidentally to write max_dens
in the filter and CV
in the ylab
we will get no warning, no error, and have no idea we did anything wrong. We should spot the error if we carefully compare the graphs among the three different response variables, as there should be some inconsistencies (e.g. exactly the same data plotted in two, or two with different data but the same ylab
).
Let’s do the graph so we making this error is impossible. First we define an object (we name it rv_oi
which is an abbreviation of “response variable of interest”) and assign to it the name of the response variable we want to graph (max_dens
in this case):
<- "max_dens" rv_oi
Then we use the same ggplot
code as before, but with rv_oi
in place of "max_dens"
(the resulting graph is in Figure 6.6):
%>%
rv filter(Response_type == rv_oi) %>%
ggplot() +
geom_jitter(aes(x = prey_richness, y = Value, col = prey_composition),
width = 0.1, height = 0) +
ylab(rv_oi)
Super. We have a safe approach and a useful graph. Do you think there is strong evidence of higher maximum predator abundance at higher prey richness? Neither do we. Strike one.
Plot the data points Note that we are plotting the data points, and not bars with error bars, or even box-and-whisker plots. There are few enough data points in this study that we experience no problems when we plot the data itself, rather than some summary of the data (e.g. the mean plus and minus the standard deviation). As a rule, always plot the data.
If we feel that we can’t so easily see the differences among prey compositions within the prey richness levels, we can make Figure 6.7, and flip the x and y-axis since the prey composition names then fit better:
Some things are a bit clearer. 1) The predator seems to like the prey Colpidium when alone more than either of the other two when alone. 2) Adding Paramecium into a mix with Colpidium gets rid of the beneficial nature of Colpidium (i.e. these two act non-additively). 3) There is huge variation in predator abundance when feeding on a mixture of Colpdium and Collodictyon. Overall, there are strong effects of prey composition, and they sometimes involve non-additive effects, but little evidence of increases in predator abundance with increases in prey richness. Let’s move on to the next response variable.
6.11.2 Predator population variability (CV)
Making the graphs for CV is now a doddle. We just set rv_oi = "CV
and then run exactly the same code as before, resulting in Figure 6.8:
<- "CV"
rv_oi %>% filter(Response_type == rv_oi) %>%
rv ggplot() +
geom_jitter(aes(x = prey_richness, y = Value, col = prey_composition),
width = 0.1, height = 0) +
ylab(rv_oi)
## Warning: Removed 6 rows containing missing values
## (geom_point).
And Figure 6.9:
## Warning: Removed 6 rows containing missing values
## (geom_point).
What do you think? Would you be confident stating that there was a negative relationship here, such that increases in prey richness cause decreases in predator population variability? Put another way, how much would you bet that in reality there is an effect? One problem here is that we have rather few data points, and quite a bit of variability even within prey compositions. Indeed, the amount of variability in the CV of the replicates of the Collodictyon prey composition is greater than all of the other data! In large part this is due to one very low CV Collodictyon replicate.
Looking back at Figure 6.2, can you guess which of the Collodictyon replicates resulted in this very low CV. It is replicate 5, the one with only two samples. But this raises a couple of important points:
How could be label data points in Figure 6.9 and facets in Figure 6.2 with the jar identifier, so we can directly see the corresponding data?
What if anything should we do about data points that seem rather extreme?
Let’s deal with number 1 first. We can put on the jar number as a labels on the facets of 6.2 and beside the data points in Figure 6.9. Here’s the code for putting the jar numbers onto the facets. First we get a tibble with the jar
number for each prey_composition
and replicate
, by selecting only those three columns and getting unique rows:
<- stab %>%
jar_text select(prey_composition, replicate, jar) %>%
unique()
The resulting tibble has 35 rows, one for each jar (microcosm). Then we use the same code as before but with an extra geom_text
in which we give the data to use jar_text
, tell it the x and y positions for the label, and then give the aesthetic mappings label = jar
(i.e. make the label
aesthetic correspond to jar
) and the aesthetic mapping col = NULL
(so that the previously defined col = species
mapping is not used). Here’s the code, and the resulting graph is in Figure 6.10:
%>%
stab ggplot(aes(x = day, y = log10(dens_per_ml), col = species)) +
geom_point() +
geom_line() +
facet_grid(prey_composition ~ replicate) +
geom_text(data = jar_text, x = 18, y = 3.5, aes(label = jar, col = NULL))
Next we add points to the graph of the CV (Figure 6.9). First we need to remake the tibble containing the CVs, this time keeping the jar
variable by adding it as a (albeit redundant) variable in the group_by
:
<- stab %>%
pred_CV filter(species == "Dileptus") %>%
group_by(jar, prey_richness, prey_composition, replicate) %>%
summarise(CV = sd(dens_per_ml) / mean(dens_per_ml),
all_zero = sum(dens_per_ml > 0) == 0)
## `summarise()` has grouped output by 'jar', 'prey_richness', 'prey_composition'. You can override using the `.groups` argument.
Install the ggrepel add-on package. It moves labels on graphs so they don’t lie on top of each other, i.e. they repel each other.
Now we remake the CV figure with the jar numbers labeling the data points. We simply add a geom_text_repel
with the jar
numbers mapped onto the label
aesthetic. Here’s the code, and the resulting graph is in Figure 6.11.
library(ggrepel)
%>%
pred_CV ggplot(aes(x = prey_composition, y = CV)) +
geom_jitter(width = 0.1, height = 0) +
coord_flip() +
geom_text_repel(aes(label = jar))
## Warning: Removed 6 rows containing missing values
## (geom_point).
## Warning: Removed 6 rows containing missing values
## (geom_text_repel).
Super. It’s extremely useful to be able to add labels to data point in graphs, so we can see which replicate, individual, site, etc., the data point corresponds to. This is especially useful when we see what appear to be rather extreme values, which bring us to the next general point—what can and should we do when we see what might be extreme values? Here is a checklist:
Use your best judgment about if a value seems extreme. If you’re in doubt, consider it extreme. Nothing we subsequently do with these apparently extreme values will cause problems if we consider some values as being extreme that other people might not.
Check that each extreme value is not the result of an error. E.g. in data entry, or subsequent assumptions in calculations. If you ever meet us in person, ask us to tell you the badger outlier story.
Plot your graphs with and without the extreme values and see if this leads to you changing your opinion about the presence, absence, or nature of any patterns. If it seems that including or excluding them is important, well, that’s your answer. Your conclusions are not so robust.
Have a look at the influence and outliers section of this web site.
While we’re on the subject of subjectivity and perception, we should think about whether flipping the axis does more than just make room for the prey composition labels. Does having the graph flipped change our perception of pattern (or lack thereof) in the data? Figure 6.12A and B. We (personally) tend to see more evidence of a negative relationship in panel B (with prey richness on the x-axis) but you might be different. We need to be careful to not exploit, or fall prey to, any such effects in the presentation of data. The data should, as far as possible, do all the talking.
## Warning: Removed 6 rows containing missing values
## (geom_point).
## Warning: Removed 6 rows containing missing values
## (geom_point).
6.11.3 Predator persistence time
Visualising predator persistence time (the last day on which a predator was observed) uses the same code as the previous two response variables, but with one tweek. We should differentiate in the visualisation data points that are and are not censored. One method is to make their data points have different shapes. That is, we map the shape aesthetic to the censored
variable (the resulting graph is in Figure 6.13):
## Warning: Removed 6 rows containing missing values
## (geom_point).
There is pretty strong evidence that more prey species available to a predator increases the time for which the predator population persists. With three prey species available, three of the persistence times were censored, and two of these were censored after 20 days—the longest of any population observed. There isn’t such a clear difference between persistence times between one and two prey communities.
We also show, as before, the same data for each prey composition, again showing which observations of persistence time are censored (the resulting graph is in Figure 6.14):
Here we can see that one of the two prey species compositions (Collodictyon and Paramecium) results in predator persistence times as long as with three prey species. And that one of the one prey species composition (Colpidium) results in persistence times as long as some of the two prey species compositions. So there are quite strong compositional effects as well as the richness effect mentioned above.
Look at the figure below, showing what happens if prey_composition
is just a character variable and not an ordered factor. The ordering of the prey compositions is not quite as desirable.
6.11.4 All three at once
You might be wondering why we went to the lengths of putting all three response variables into one tibble and the having to filter in order to plot each of the three graphs. This was a bit of a faff. We could have just continued using the three separate tibbles we made. Here comes the answer, or at least here comes one answer. If we put all three response variables into the same tibble the we can plot all three graphs at the same time, in one ggplot with three facets (Figure 6.16):
%>%
rv ggplot() +
geom_jitter(aes(x = prey_richness,
y = Value,
col = prey_composition),
width = 0.1, height = 0) +
facet_wrap(~ Response_type, scales = "free_y") +
theme(legend.position = "top") +
guides(colour = guide_legend(title.position = "top",
nrow = 3,
position = "top"))
## Warning: Removed 12 rows containing missing values
## (geom_point).
How useful is it to be able to plot all response variables in one go like this? We’re not too sure. We now have the problem of how to show the censored data points in the last_day_observed
data. We also see that the order of the facets is different from the order of the response variables presented in the text above. This would need changing for consistency (you can probably guess how we would do this… the same thing we did with the prey_composition
variable). We’ll leave this for you to do. And also to make the version of this graph with prey_composition
as the explanatory variable.
6.12 Wrapping up
The published paper on this dataset had a few other analyses, but not too many. So what you learned above is very much a replica of a real and quite complete representation of the results and insights in the that study. Yes, we didn’t do the statistics here, but our conclusions are pretty much the same (though this could easily be in part influenced by Owen knowing the conclusions before he wrote this demonstration).
One important thing. It would be very very wrong for you to get the impression that one can think through all the likely issues with the data and in a lovely linear process work from the data to the Insights. Probably this impression comes from this chapter and those of the first case study (though we made the same point there). For example, we do things early on in the data manipulation in order to make things easier later on. It looks like we anticipate a problem that will arise and prevent it ever occurring.
The truth is, however, that we often found problems once we had got quite far into the data processing, and then realised we should go back and fix their cause. E.g. we found a problem while plotting the data, realised this was caused by an issue that could be sorted out much earlier on, and therefore have positive effects on all subsequent analyses. We did not just do a quick fix on the problematic graph… we fixed the source of the problem. This is a very very good habit to get into. Fix the cause, don’t just treat the symptom. Fixing the cause will increase efficiency, reliability, and all round happiness.