# Chapter 14 Answers and solutions

## 14.1 Preface

**Q0.1** Who is the intended audience/users/readers of *Insights*?

Answer: **A**, **C**, and **D** are correct. *Insights* is intended for
beginner scientists in the life and environmental scientist that will be
working with quantitative data. Our other book, *Getting Started with R*
is for folk that what an introduction to R and who already know a bit
about data analysis (just not in R). **B**: If you already have good
knowledge of data analysis and statistics and need an introduction to R,
you might like to look at our other book, *Getting Started with R, 2nd
Edition*.

**Q0.2** *Insights* teaches a newer (i.e. tidyverse) approach to using
R, and not what might be called the “classic” or “traditional” approach.
What does this mean?

Answer: **A**, **B**, **D**, and **E** are correct. We teach the newer
(tidyverse) approach in our own Undergraduate-Level Introduction to Data
Analysis courses. It works really well. The newer (tidyverse) approach
is not advanced R, it is simple, intuitive, and powerful R. (*Insights*
contains not a single square bracket or dollar sign.)

**Q0.3** *Insights* is a book about the process of getting insights from
data, and yet it contains no statistical analyses (e.g. no regression,
ANOVA, linear models, or generalised linear models. Why are such things
not included?

Answer: Every answers is a good reason for starting learning about data analysis without considering statistics.

**Q0.4** What proportion of a data analyists time and effort is spent on
tasks such as preparation, import, cleaning, tidying, checking,
double-checking, manipulating, summarising and visualizating data?

Answer: We write “about 80%.” Though really we just mean *a lot*. Care
and attention to this work is essential to provide a foundation dataset
from which to derive insights.

**Q0.5** From where do you get the datasets that you need to work along
with the demonstrations in the book?

Answer: Only **D** is correct. You get the data from the online data
repositories where the authors of the original studies deposited their
data. Unlike the majority of “Introduction to Data Analysis” books,
*Insights* starts from the often dirty and messy state data usual begins
in (though the datasets in the Workflow Demonstrations are mostly quite
clean!). Its lot of work, but absolutely essential work to do safely and
reliably, to get the data tidy, clean, and arranged so that
visualisation and summarisation (and in the end, statistical tests) are
straightforward and efficient, and that the insights derived from them
are accurate, reliable, and robust.

**Q0.6** Which one of the book’s authors works at the Natural History
Museum in London?

Answer: Natalie.

**Q0.7** Which one used to brew beer (from grain)?

Answer: Owen.

**Q0.8** Which one has has strangely hairy ears?

Answer: Andrew.

**Q0.9** Which one has cuddled sheep on a remote Scottish Island?

Answer: Dylan.

## 14.2 Chapter 1 (Introduction)

**Q1.1** Can we get insights from data without using complex statistical
models and analyses, without machine learning, without being a master
data scientist?

Answer: Yes, it is possible to get insights from data without all these things. And we think doing so is a great starting point to learn fundamental skills for getting insights from data. Getting insights from more complex and larger datasets than used in the book can, however, be greatly assisted by statistical models, machine learning, and many other more advanced methods.

**Q1.2** What is an advantage of focusing on the data without any
statistical models/tests to think about?

Answer: We will focus on important characteristics of the data and the patterns in it. We are more likely to think about the strength of and practical importance of patterns in the data. We are less likely to focus on statistical significance at the expense of everything else.

**Q1.3** With what should we start our journey from data to insights?

Answer: Out of the four options given, a question is the first we should think about. The clearer and more specific the better. Making a clear and specific question can be assisted by sketching a graph. We must be careful to minimise the possibility for our insights to be affected by what we would like to find.

**Q1.4** Why is it important to know if a study resulting in a dataset
was a randomised manipulative experiment?

Answer: With a randomised manipulative experiment we have a chance of inferring causation… that changing somethings caused a change in something else. On the down side, logistical constraints of make such experiments occur in rather controlled and less realistic settings than other types of studies.

**Q1.5** Datasets, i.e. convenient arrangements of data, come in many
forms. Which arrangement is used throughout this book?

Answer: We focus on rectangular data, i.e. a table of data with rows and columns. A spreadsheet is an example of such rectangular data. We focus on this as it is a simple, useful, and flexible way of arranging data. Furthermore, there are many convenient and powerful approaches for working with data thus arranged.

**Q1.6** What is a “response variable” and what are other names for one?

Answer: A *response variable* contains the data
(measurements/observations) that we are interested in understand the
variation in. This is the variable that is “responding” to other
variables. It is also know as the *dependent* variable. Typically we put
the response variable on the y-axis of a graph (i.e. we *map* variation
in the response variable to variation along the y-axis).

**Q1.7** What is an “explanatory variable” and what are other names for
one?

Answer: An explanatory variable is a variable that contains measurements
that can *explain* variation in a response variable. They are also
termed the *independent* variable and the *predictor* variable. The
rationale for these names is that an explanatory variable is not
dependent on other variables, and can be used to predict variation in
the response variable.

**Q1.8** Give five important features of datasets, and explain in your
own words why they are important.

Answer: The five given in *Insights* are 1) the number of observations,
2) the number of variables, 3) if variables describe the manipulations
of a randomised experiment, 4) correlation among the variatbles, and 5)
how independent are the observations. See the text of the book for why
we think these are important. Great if you have thought of other
important features… drop us a line to let us know.

**Q1.9** What do each of the four demonstration datasets used in the
Workflow Demonstrations share in common?

Answer: They all concern food!

**Q1.10** Which step in getting insights from data was missing from the
presented workflow? (This is a bit of a trick question.)

Answer: Perhaps lots were missing, and some of the presented steps could be broken up into multiple steps. Its always risky to present a general recipe for something as ultimately diverse as getting insights from data. Still, its a good idea to start somewhere. The missing step mentioned was communication, by the way. Or at least that is one that we thought of as rather important.

## 14.3 Chapter 2 (Getting Acquainted)

**Q2.1** True or false: R and RStudio are the same software program?

Answer: This is false. They are two separate programs. RStudio is like a helper program that surrounds R. We only use R via RStudio. Because they’re separate programs we must update each. I.e. updating one will not update the other.

**Q2.2** What will R tell us if we type into the Console `1 + 3 * 3`

?

Answer: The answer is 10. R does the multiplication first, as it should.

**Q2.3** What will R tell us if we type into the Console `log(100)`

?

Answer: The answer is 4.60517. This is because the `log`

function is the
natural log, often written as “ln.” If you answered 2 you were thinking
of `log10(100)`

**Q2.5** How would we assign to the name “my_constant” the value of
“log(100)?”

Answer: `my_constant <- log(100)`

**Q2.5** What commands should we be typing directly into the Console?

Answer: Any that we know that we don’t want to remember and don’t want to use again. I.e. very few. The vast majority we type in our script, and then “send” to the Console. This practice will result in our script containing a complete record of everything we did.

**Q2.6** What is the error in the code?

Answer: In the second line of code the name `my_x_variable`

was
mispelled (the second “a” was missing). Hence we get the error:
`object 'my_x_variable' not found`

.

**Q2.7** When we give a function some arguments, when should we name the
arguments?

Answer: Name them whenever you are not sure about what order the function expects the arguments in. When staring with R, it can be comforting to always name the arguments, and then to relax this when one becomes more used to what individual functions expect.

**Q2.8** True or false: it is good practice to have at the top of our
script a command to *install* each of the packages we would like to use.

Answer: False. Do not do this. We only need to load the add-on package
with, for example, `library(dplyr)`

. Installing packages every time you
run your code is a waste of time.

**Q2.9** True or false: When asking help from others it is sufficient to
email them a verbal description of the problem, and a copy of the error
message R gave.

Answer: False. This will rarely be enough for someone to help find the problem. Even sending the line of code that produces the error is often not sufficient. Better to send more code than less, and if possible code that works, at least up until if fails. And send sample data if the code reads in and uses data.

**Q2.10** If RStudio freezes, are we going to lose all our work?

Answer: Probably not. RStudio has very likely autosaved the very latest version of our script. Just in case, attempt to copy the script from RStudio and paste it somewhere safe, before quitting and restarting RStudio.

## 14.4 Chapter 3 (Workflow Demonstration–Part 1)

**Q3.2** What are characteristic features of tidy data?

Answer: In a tidy dataset, the same type of data is not spread across multiple columns, i.e. one variable (one type of information) only occurs in one column. A corollary is that each row contains only one observation. More information, about tidy data, and converting between tidy and not-so-tidy data is given in chapter 7 of the Insights book.

**Q3.3** Write down two reasons why its a good idea to use the Projects
that are available in RStudio.

Answer: Using projects makes our code more sharable, i.e. will work on someone else’s computer without changing anything in the code. Switching between projects is simple.

**Q.3.4** What is the name given to this term `%>%`

and what does it do?

Answer: The term `%>%`

is known as the *pipe*. It is used to “send” the
outcome of one function into another. You will get to know it well, as
we use it *a lot*; it is described in chapter 6 of the Insights book.

**Q3.5** A little trickier perhaps… which variable in the bat diet
dataset is numeric but should, arguably, not be?

Answer: `Bat_ID`

is numeric, but these are identities (i.e. names). They
could just as well be words. Leaving them as numbers could allow us to
do something stupid, like involving them in a calculation. Using numbers
in such situation can also result in `ggplot`

not doing what we’d like,
for example would lead to a colour gradient rather than discrete colours
if we mapped the numerical variable to the colour aesthetic.

**Q3.6** In what type of object do we often (and in this workflow) store
data in R, and what are some features of this object.

Answer: We store data in R in a *tibble*, which is a special type of
data frame. It has rows each containing an observation, and columns
containing variables.

**Q3.7** We cleaned up the variable names using a string replacement
functions (`str_replace_all`

). Find the name of the add-on package that
contains the rather nice function `clean_names()`

, which can quickly and
easily result in us having nice clean variable names.

Answer: The **janitor** add-on
package
contains the `clean_names()`

function.

**Q3.8** Take a look at this version of the script of this part of the
Bat Diet workflow
demonstration.
We have added a variety of errors into the script, and your task is to
find the errors and correct them.

Answer: The script without errors is here. Many of the errors were typos, and would have caused an error in R. Remember that some errors (e.g. doing multiplication rather than addition) will cause R to show an error. Avoiding making these (e.g. by writing readable code), and having checks to spot them when they occur is very important

## 14.5 Chapter 4 (Workflow Demonstration–Part 2)

**Q4.1** Chapter 3 mostly concerned getting the data ready. Chapter 4 is
mostly about getting information from the data, including answering our
questions. What two add-on package are used to do this, and what is each
of them useful for?

Answer: We use the **dplyr** package for manipulating data and making
calculations based on the data (e.g. to calculate numbers observations,
and mean observations), and we use the **gplot2** package for
visualising the data.

**Q4.2** What do we mean by “sanity-checking insights?”

Answer: “Sanity-checking insights” are insights that confirm basic information about the data is as we expect it to be. The insight is that we are seeing what we expect to see. These are just as important as insights about our questions, as they give us confidence that our work with the data is reliable and accurate.

**Q4.3** How many times will you accidentally type in your code `=`

when
you meant to type `==`

, and what is the difference?

Answer: If you’re anything like us (Owen at least) you will do this for as long as you’re coding, meaning that it happens a lot. The double equals sign is a logical operator, and asks if the things on either of it are equal or not. A single equals is a name-value pair, used to associate a value with a name.

**Q4.4** Imagine we calculate the mean of a variable and the answer, to
our surprise, is `NA`

. This is a surprise because we know the variable
is numeric and we were therefore expecting a number. What is a likely
reason for us getting NA, and what is one way of dealing with this?

Answer: It is likely that the variable contains at least one `NA`

,
causing the calculation of the mean to fail (in the sense that the
answer is NA). We can ask for the NA values to be ignored when
calculating the mean with the arguement `na.rm = TRUE`

**Q4.5** Imagine that we observe that the association between two
variables seems to depend on a third variable. E.g. the effects of
herbivory on plant biomass is negative in the absence of a predator, but
is positive in the presence of a predator. Which of the terms might be
appropriate to describe this kind of pattern?

Answer: When we observe that the association between two variables seems
to depend on the value of a third variable, we essentially observing
*context dependence* in the association. The association between the two
variables depends on the context, which here is the value of the third
variable. This kind of pattern is also termed an *interaction* between
two variables. They are not acting additively/independently. And
obviously the term *associated* is not sufficient to describe a context
dependent association.

**Q4.6** We made a type of graph that allowed us to see the number of
observations of a variable that had particular values (i.e. the number
of values that fell in a particular *bin*)? What is a name of this type
of graph, and why is it import to look at this?

Answer: This type of graph is called a *histogram* or a *frequency
distribution*. The are very useful for viewing the range of values of a
variable, and which values are most common. These information are
essential for drawing conclusions from the data; for example they can
determine whether the mean or median is a more appropriate measure of
central tendency (i.e. where the centre of the distribution lies).

**Q4.7** We counted (calculated) the number of different prey species
detected in the poop of each bat. What is the smallest value this number
could take, and why is this important to realise?

Answer: The lowest number of different prey species detected is zero.
Hence this distribution cannot include values of less than zero… it is
impossible. This is important to realise because it gives us clues about
what types of statistical tests we would do (though we do not do any in
*Insights*). For example, the Poisson distribution is often used in
statistical tests performed on count data.**

**Q4.8** In chapter 4 (and at many other places in *Insights*) we leave
the axes labels of the displayed graphs with relatively ugly labels.
E.g. in Figure 4.9 the y-axis label is `num_prey`

. Why don’t we make
these and other labels nicer?

Answer: These are graphs meant for us to explore the data, and to assess the weight of evidence for patterns in it. So long as the axes (and other) labels are good enough for us to understand, great. There is no need, at this point, to make the graphs nicer. That can wait until we are more certain about what we will use to communicate our findings.

**4.9** Did you follow and understand the second part of this chapter “A
prey-centric view” of the data and questions?

Answer: If you answered no, then no problem! It is pretty complex, with some quite involved series of R operations, and also some new concepts. So keep the task of looking through this section for a later date, when you feel like a bit of a challenge. The most important take-home message from this section is that often there are alternate ways of looking at the same data, each of which can provide additional insight.

**Q4.10** The workflow demonstration presented in chapter 3 and 4 were
quite linear… we did one thing, then built on it, then another
building on that. What did we do in reality, while developing the
workflow demonstration?

Answer: We worked with the data a bit, and then found something that didn’t work so well, and so then went back and made a change to code so that it worked better. There was quite a lot of this optimisation. Though we were always aware of and focused on where we wanted to get to (i.e. the graphs we were aiming for that would answer our question) we did not from the start see the linear path to that goal. We went down some dead-ends, or at least some rather messy paths, and once we recognised that we had, we went back and changed direction to be more direct, elegant, robust, and reliable. Many of these changes were to prevent problems from ever occuring, as this is preferable to fixing a problem once it has occured.

## 14.6 Chapter 5 (Dealing with data 1—Digging into dplyr)

### 14.6.1 General questions and exercises

**Q5.1** What **dplyr** function (i.e. function from the **dplyr**
add-on package) do we use to calculate summary information about data?

Answer: `summarise`

**Q5.2** What **dplyr** function do we use to keep a collection of
rows/observations in a dataset, according to the values in one of the
variables?

Answer: The `filter`

function is used to keep a set of rows, according
to values in a variable. E.g. keep the rows/observations for male bats:

```
%>%
bats filter(Sex == "Male")
```

**Q5.3** What is `%in%`

used for?

Answer: It is used to ask if a value (e.g., `Owen`

) appears in a list of
other values. E.g. try guessing what these will do, and then run them to
check your guess:

```
"Owen" %in% c("Owen", "Natalie", "Dylan", "Andrew")
c("Owen", "Heath") %in% c("Owen", "Natalie", "Dylan", "Andrew")
```

**Q5.4** What What **dplyr** function do we use to add a new variable to
a dataset that is a transformation of an existing variable?

Answer: `mutate`

– for example, it the book we used `mutate`

to replace
the “M” with “Male,” and “F” with “Female.”

**Q5.5** List four other **dplyr** functions and write about what
they’re used for.

Answer: `arrange`

, `select`

, `rename`

, `group_by`

, and `summarise`

.
Please see the book for descriptions of what these do (or search
online).

### 14.6.2 Bat diet workflow questions and exercises

**Q5.6** Find the identity (`Bat_ID`

) of the two bats that ate only the
largest (52.5mm wingspan) type of prey.

Answer:

```
%>%
bats filter(Wingspan_mm == 52.5) %>%
select(Bat_ID)
```

```
## # A tibble: 2 x 1
## Bat_ID
## <dbl>
## 1 367
## 2 1706
```

**Q5.7** How many different prey *species* were detected in total?

Calculate the total number of prey species by counting the number of
unique values in the prey `Species`

variable:

Answer:

```
%>%
bats summarise(total_num_prey_species = length(unique(Species)))
```

```
## # A tibble: 1 x 1
## total_num_prey_species
## <int>
## 1 115
```

**Q5.8** The following code is intended to calculate the number of prey
items found in each poop. Find and correct the three intentional errors
it contains:

```
< bats %>%
prey_stats group_by(Bat_Id) %>%
summarise(num_prey = n()
```

Answer: The assignment arrow was missing its `-`

, the `d`

of `Bat_Id`

should be capital `D`

, and an additional closing bracket was needed at
the end of the third line:

```
<- bats %>%
prey_stats group_by(Bat_ID) %>%
summarise(num_prey = n())
```

**Q5.9** Calculate the number of times each prey item was observed.

Answer:

```
<- bats %>%
num_times group_by(Species) %>%
summarise(num_times_obs = n())
```

**Q5.10** Calculate number of migratory and non-migratory prey species,
and pest or non-pest, and each combination of migratory and pest.

Answers:

```
%>%
bats group_by(Migratory) %>%
summarise(total_num_prey_species = length(unique(Species)))
```

```
## # A tibble: 2 x 2
## Migratory total_num_prey_species
## <chr> <int>
## 1 no 94
## 2 yes 21
```

```
%>%
bats group_by(Pest) %>%
summarise(total_num_prey_species = length(unique(Species)))
```

```
## # A tibble: 2 x 2
## Pest total_num_prey_species
## <chr> <int>
## 1 no 86
## 2 yes 29
```

```
%>%
bats group_by(Migratory, Pest) %>%
summarise(total_num_prey_species = length(unique(Species)))
```

`## `summarise()` has grouped output by 'Migratory'. You can override using the `.groups` argument.`

```
## # A tibble: 4 x 3
## Migratory Pest total_num_prey_species
## <chr> <chr> <int>
## 1 no no 78
## 2 no yes 16
## 3 yes no 8
## 4 yes yes 13
```

**Q5.11** What was the maximum number of times a prey species could have
been observed?

Answer: This is the number of bat poops observed (143):

```
%>%
bats pull(Bat_ID) %>%
unique() %>%
length()
```

`## [1] 143`

**Q5.12** What proportion of prey species were observed in only one poo?

Answer:

```
%>%
num_times summarise(prop_singletons = sum(num_times_obs == 1) / n())
```

```
## # A tibble: 1 x 1
## prop_singletons
## <dbl>
## 1 0.461
```

**Q5.13** How many bats were caught on each of the dates?

Answer: Get the number of bats caught on each date with a `group_by`

with `Bat_ID`

and `Date_proper`

as the grouping variables. Pipe the
result into a `summarise`

to get the unique `Bat_ID`

s, then make another
`group_by`

but this time on the `Date_proper`

variable only and use the
`n()`

function to count the number of bats per date:

```
<- bats %>%
bats_per_date group_by(Bat_ID, Date_proper) %>%
summarise(Unique_Bat_ID = unique(Bat_ID)) %>%
group_by(Date_proper) %>%
summarise(Number = n())
```

`## `summarise()` has grouped output by 'Bat_ID'. You can override using the `.groups` argument.`

Here is a graph of number of bats caught on each date:

```
%>%
bats_per_date ggplot() +
geom_point(aes(x = Date_proper, y = Number)) +
ggtitle("Number of bats caught on each date.")
```

**Q5.14** The Abstract of the paper states that Lepidoptera were mostly
from the Noctuidae and Geometridae familes. How many species of
Noctuidae and Geometridae are there in the dataset? Get the number of
species of Noctuidae and Geometridae in the dataset.

```
# Solution
<- bats %>%
families select(Species, Order, Family) %>%
distinct() %>%
group_by(Order, Family) %>%
summarize(num_spp=n())
%>%
families filter(Family %in% c("Noctuidae", "Geometridae"))
```

**Q5.15** The paper states that *56.9±36.7% were migratory moth
species*. Calculate this yourself.

Answer: In this solution we calculate the proportion migratory in the diet of each bat, then calculate the mean and standard deviation of these proportions. The answer we find is not the same as in the paper, though we are not sure why.

```
%>%
bats group_by(Bat_ID, Migratory) %>%
summarise(num_prey=n()) %>%
spread(key=Migratory, value=num_prey) %>%
mutate(no=ifelse(is.na(no), 0, no),
yes=ifelse(is.na(yes), 0, yes),
perc_migratory=yes/(yes+no)) %>%
ungroup() %>%
summarise(mean(perc_migratory),
sd(perc_migratory))
```

**Q5.16** Confirm the results from the paper: *Moths (Lepidoptera;
mainly Noctuidae and Geometridae) were by far the most frequently
recorded prey, occurring in nearly all samples and accounting for 96 out
of 115 prey taxa.*

```
%>%
bats select(Species, Order) %>%
distinct() %>%
group_by(Order) %>%
summarize(num_spp=n())
```

**Q5.17** Confirm the results from the paper: *Each pellet [poo]
contained on average 4.1 ± 2.2 prey items*

```
%>%
bats group_by(Bat_ID) %>%
summarise(num_prey=n()) %>%
summarise(mean(num_prey),
sd(num_prey))
```

A slightly different answer to that in the paper. We are not sure why.

## 14.7 Chapter 6 (Dealing with data 2—Expanding your toolkit)

### 14.7.1 General questions and exercises

**Q6.1** Describe in your own words a couple of reasons why piping
(using `%>%`

) is useful.

Answer: Here are a couple. You may have others too. Pipes allow us to avoid making intermediate datasets that we have no direct use for. These create clutter, which isn’t a big deal sometimes, but can get a bit distracting. Another reason is that we avoid the option of nesting functions inside functions.

**Q6.2** Manipulating strings is a very important skill. In your own
words, describe what the following functions do: `str_replace_all`

,
`case_when`

, `separate`

, and `grepl`

(all of which are mentioned in the
book).

Answer: Here in our words:

`str_replace_all`

– replaces all occurences of one pattern in a string with another pattern.`case_when`

– do a case by case replacement of string with other strings.`grepl`

– report if a string does or does not contain another string.

**Q6.3** In your own words describe three other string manipulation
functions.

Answer: Our choice: `str_sub`

– used to extract parts of a string, e.g.
the 3rd to 6th character; `str_detect`

– report if a pattern occurs in
a string; `str_length`

report the number of characters in a string.

**Q6.4** We can use the function `dmonths(1)`

to get the duration of one
month, according to R. In which case we see that the duration is
`2629800s (~4.35 weeks)`

. What does this mean about how **lubridate**
calculates this duration of a month?

Answer: 365.25 days / 12 months * 24 hours * 60 minutes * 60 seconds
= 2629800 seconds. So `dmonths(1)`

calculates the duration of a month as
the average duration of a month in an average duration year (365.25
days, due to leaps years every four years). The implication is that we
must be very careful if we want to use the duration of a specific month
in our calculations.

**Q6.5** Working with dates and times can be rather tricky and painful.
What add-on package do we recommend for working with dates and times,
because it contains many very useful and simple-to-use functions?

Answer: We recommend the add-on package `lubridate`

for working with
dates and times.

**Q6.6** What would the function `ymd`

be useful for?

Answer: The function `ymd`

is used to convert strings that contain year,
then month, then day (e.g. 2020-11-03) into date format.

**Q6.7** Why is it useful to convert strings containing dates and times
into date and time formatted variables?

Answer: Once converted, we can do maths with these (e.g. calculate the
amount of time between two date/times). Also, when we use `ggplot`

to
visualise data with a date-formatted variable, we will get dates
displayed properly and nicely in the graph.

**Q6.8** In your own words write down what the `pivot_longer`

function
is used for, and also the three arguments (and their meanings) that we
usually give it.

Answer: `pivot_longer`

is used to take information that is spread across
multiple columns and put it into one (long) column. I.e. it is used to
convert data from wide to long format. Long format is often quite a tidy
format, and is therefore efficient for processing with functions in the
**dplyr** and **ggplot2** add-on packages. The three arguments we
usually give it are:

`names_to`

– the name of the new column in the new dataset that will contain the names of the variables in the old dataset.`values_to`

– the name of the new column in the new dataset that will contain the values in the variables in the old dataset.`cols`

: the columns in the old dataset that contain the data that should be gathered into one column.

**Q6.9** Look at the online Workflow Demonstration “Food-diversity
Polity” and work through tidying (particularly the use of
`pivot_longer`

) the FAO Food Balance Sheet data.

Answer: no answer, just work through it.

### 14.7.2 Bat diet workflow questions and exercises

**Q6.10** Make a pipeline of operations, just like the one in the
Insights book, that will calculate the number of prey species in the
poop of each bat, and the average number of prey per bat for each
taxonomic Order.

Answer:

```
%>%
bats group_by(Bat_ID, Order) %>%
summarise(Num_prey = n()) %>%
group_by(Order) %>%
summarise(Mean_num_prey = mean(Num_prey))
```

`## `summarise()` has grouped output by 'Bat_ID'. You can override using the `.groups` argument.`

```
## # A tibble: 7 x 2
## Order Mean_num_prey
## <chr> <dbl>
## 1 Coleoptera 1
## 2 Diptera 1.12
## 3 Hemiptera 1
## 4 Lepidoptera 4.06
## 5 Neuroptera 1.29
## # … with 2 more rows
```

**Q6.11** Make a variable containing both the age and sex of the bats.
E.g. “Adult-Female,” “Adult-Male,” and so on.

Answer: There are a number of ways to do this. A `mutate`

and `paste`

within that would work. In our answer we use the `unite`

function from
the `tidyr`

add-on package…

```
<- bats %>%
bats unite(col = Age_Sex,
c("Age", "Sex"),
sep = "-",
remove = FALSE)
```

**Q6.12** How long did the fieldwork take, from the first sample to the
last?

Answer: We used summarise to subtract the maximum value of the date variable from the minimum…

```
%>%
bats summarise(duration = max(Date_proper) - min(Date_proper))
```

```
## # A tibble: 1 x 1
## duration
## <drtn>
## 1 453 days
```

If you get an error, please ensure you used the proper date variable…
it must be formatted as a date (e.g. using the `dmy`

function from the
**lubridate** package).

**Q6.13** Change the arrangement of the bat diet data to be wide-format.
That is, for each bat make there be only one row, and have a column for
each of the possible prey species, with number of reads at the entry.

Answer: Use the `pivot_wider`

function. Be careful to give *all* the
columns that identify the rows (`id_cols`

), the just specify which
column contains the names, and which column contains the values…

```
<- bats %>%
bats_wide pivot_wider(id_cols = c("Bat_ID", "Age", "Sex", "Date_proper"),
names_from = "Species",
values_from = "No._Reads")
```

**Q6.14** If you have NA’s in the wide dataset you made for the answer
to the previous question, figure out how to replace these with zeros.
Hint: you do not need to use another function or operation.

Answer: Add the argument `values_fill = 0`

to the use of the
`pivot_longer`

function:

```
<- bats %>%
bats_wide pivot_wider(id_cols = c("Bat_ID", "Age", "Sex", "Date_proper"),
names_from = "Species",
values_from = "No._Reads",
values_fill = 0)
```

**Q6.15** How many rows will there be when we change this `bats_wide`

dataset back to long format?

Answer: It will be the number of rows in `bats_wide`

(143) multiplied by
the number of prey species (119 columns minus the 4 id columns = 115).
So we should have 16’445 rows.

**Q6.16** And now switch the `bats_wide`

dataset back to long format,
and the make a new column that is a binary prey presence/absence
variable.

Answer:

```
<- bats_wide %>%
bats_long pivot_longer(names_to = "Prey_species",
values_to = "Num_reads",
cols = 5:119) %>%
mutate(Pres_abs = ifelse(Num_reads == 0, 0, 1))
```

**Q6.17** Confirm that the number of presences is the same as in the
original dataset.

Answer:

```
%>%
bats_long summarise(total_prey = sum(Pres_abs))
```

```
## # A tibble: 1 x 1
## total_prey
## <dbl>
## 1 633
```

Perfect… 633 is the number of rows in the original dataset (in which only presences were present).

## 14.8 Chapter 7 (Getting to grips with ggplot2)

### 14.8.1 General questions and exercises

**Q7.1** What is the difference between **ggplot** and `ggplot`

?

Answer: **ggplot2** is the add-on package. `ggplot`

is one of the
functions in that package. Hence we would run `library(ggplot2)`

, and
then use the function `ggplot`

to make a particular graph.

**Q7.2** In the context of a graph being made with `ggplot`

, give two
examples of aesthetic mappings, and give the function we use to specify
them.

Answer: Example aesthetic mappings are 1) mapping of the colour of a
point to a variable that contains levels low, medium, and high; 2)
mapping of a numeric variable to the x-axis position of a point. We use
the `aes`

function to specify the aesthetic mappings.

**Q7.3** In the context of **ggplot2**, what is a “scale?”

Answer: The scale of a **ggplot2** graph controls how the data are
mapped to the aesthetic attributes, such as an x/y location, or the
colour and size of points in a plot. I.e. with the aesthetic mapping we
specify *what* is mapped to what, and with a scale we specify *how* it
is mapped.

**Q7.4** In the context of **ggplot2**, what is a “layer?”

Answer: A layer is a set of information that we add to the graph. E.g. some points corresponding to some data. Or some lines joining data points. Or some text. Or an image. Think of layers as transparent sheets of plastic that we draw something on, and then lay onto the graph to add information.

**Q7.5** In the context of **ggplot2**, what is a “facet?”

Answer: A facet is a panel for subsets of data, such as a separate panel
for each sex or age. We can use the `facet_grid`

or `facet_wrap`

function to make a graph with multiple panels (facets), each showing a
subset of the data.

### 14.8.2 Bat diet workflow questions and exercises

**Q7.6** Plot at the shape of the distribution of number of poops
collected on each sampling date.

Answer:

```
%>%
bats ggplot() +
geom_histogram(aes(x = Date_proper))
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

**Q7.7** Plot the distribution of number of reads. What shape is it?
From looking at the distribution, predict whether the mean or median is
larger. Then check your prediction by calculating the median and mean.
For extra credit, add two vertical lines to the graph, one where the
mean is, and one where the median is (hint, use the `geom_vline`

function).

Answer:

```
%>%
bats ggplot() +
geom_histogram(aes(x = No._Reads))
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

Because the distribution has a few very large values, the mean will be pulled to the right (be larger), relative to the median. Lets check if we’re correct:

```
%>%
bats summarise(mean_num_reads = mean(No._Reads),
median_num_reads = median(No._Reads))
```

```
## # A tibble: 1 x 2
## mean_num_reads median_num_reads
## <dbl> <dbl>
## 1 7700. 2741
```

Yes, the mean is much much larger than the median.

```
%>%
bats ggplot() +
geom_histogram(aes(x = No._Reads)) +
geom_vline(xintercept = 7700) +
geom_vline(xintercept = 2741)
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

`## `summarise()` has grouped output by 'Bat_ID', 'Sex'. You can override using the `.groups` argument.`

**Q7.8** The following code is intended to create a plot of the wingspan
of prey found in male and female bat poops. Find and correct the three
intentional errors it contains:

```
%>%
prey_stats ggplot() %>%
geom_beewarm(mapping = aes(x = Sex y = mean_wingspan))
```

Answer: The second pipe should be a plus sign `+`

: we are *adding* a
layer to the `ggplot`

; we are not piping. This is a common error we
make. There is an `s`

missing in the `geom_beeswarm`

. Finally, the comma
that must separate the two arguments in the `aes`

function is missing.
Here is the corrected code (though it will not work if you have not
already made the `prey_stats`

object):

```
%>%
prey_stats ggplot() +
geom_beeswarm(mapping = aes(x = Sex, y = mean_wingspan))
```

**Q7.9** Calculate and plot a histogram of the probability of observing
each of the prey species across the whole study.

Answer: The probability is the number of times observed divided by the number of times a species could have been observed:

```
<- bats %>%
max_poss_obs pull(Bat_ID) %>%
unique() %>%
length()
<- bats %>%
prob_obs group_by(Species) %>%
summarise(prob = n() / max_poss_obs)
```

And the histogram:

```
%>%
prob_obs ggplot() +
geom_histogram(aes(x = prob))
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

**Q7.10** Ensure that you have the odds ratio for appearing a male or
female poo for each of the prey species (the `odds_ratio`

object in the
book). Plot a histogram of all the calculated odds ratios. Guess how
many prey species with odds less than twice and less than half to appear
in female compared to male poos. Calculate how many, and see how close
is your guess. (Hint: it may be worthwhile to plot log2 of the odds
ratio.)

Answer:

```
%>%
odds_ratios ggplot() +
geom_histogram(mapping = aes(x = Odds_ratio))
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

```
## Warning: Removed 22 rows containing non-finite values
## (stat_bin).
```

```
%>%
odds_ratios ggplot() +
geom_histogram(mapping = aes(x = log2(Odds_ratio)))
```

```
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
```

```
## Warning: Removed 65 rows containing non-finite values
## (stat_bin).
```

```
%>%
odds_ratios summarise(ans = sum(less_than_half_or_twice = abs(log2(Odds_ratio)) < 1))
```

```
## # A tibble: 1 x 1
## ans
## <int>
## 1 25
```

**Q7.11** And now for a little bit of fun: Combine the script that
displays a photo of a cute dog with the script that gives some praise to
create a motivational poster of a dog giving praise.

Answer: Please let us know if you do this. We want to see your solutions, and will post them on Twitter (if you like)!

## 14.9 Chapter 8 (Making Deeper Insights: Part 1 - working with single variables)

### 14.9.1 General questions and exercises

**Q8.1** Give an example of a *continuous numeric variable* and explain
why it is so.

Answer: Examples of continuous variables include body mass, age, time,
and temperature. They can, in principle, take any positive real value.
You may also have given *change* in mass, which can take any (positive
or negative) real value.

**Q8.2** Give an example of a *discrete numeric variable* and explain
why it is so.

Answer: Examples of discrete variables include the number of individuals in a population, number of off- spring produced , and number of individuals infected in an experiment. The can only be integers.

**Q8.3** Give an example of a *categorical variable*, and state if it is
*ordinal* or *nominal*.

Answer: Examples of *ordinal categorical variables* include academic
grades (i.e. A, B, C), size class of a plant (i.e. small, medium, large)
and measures of aggressive behaviour. Examples of nominal categorical
variables include sex, business type, eye colour, habitat, religion and
brand. There is no intrinsic order to the ‘levels’ of these categories.

**Q8.4** Numeric variables can also be said to be measure on an
*interval* or a *ratio* scale. Explain the difference and give examples
of each.

Answer: We can say that one tree is twice as tall as another, or that
one elephant has twice the mass of another, because length and mass are
always measured on *ratio* scales. I.e. the ratio of two values is
meaningful. Interval data still allow for the degree of difference
between values, just not the ratio between them. A good example of an
interval scale is date, which we measure relative to an arbitrary epoch
(e.g. AD). Hence, we cannot say that 2000 AD is twice as long as 1000
AD. However, we can compare the amount of time that has passed between
pairs of dates, i.e. their interval. For example, it’s perfectly
reasonable to say that twice as much time has passed since the epoch in
2000 AD versus 1000 AD.

**Q8.5** Explain in your own words what is meant by a “sample
distribution.”

Answer: The sample distribution is a statement about the frequency with which different values occur in a particular sample. If we were to repeat the same data collection protocol more than once we should expect to end up with a different sample each time. This results purely from chance variation and the fact that we can almost never sample everything we care about.

**Q8.6** What type of graph is a very good one for getting a summarising
a numeric variable?

Answer: A histogram / frequency distribution. It illustrates central tendency (mean, median, mode) and spread.

**Q8.7** When we make a histogram we much choose the number of bins.
What is good practice when doing so?

Answer: To make a few histograms, each with different numbers of bins, and examine if this changes our conclusions about the distribution of the variable.

**Q8.8** If we have rather few data points (e.g. less than 30 or so),
what type of graph might we make instead of a traditional histogram?

Answer: A dot plot, in which each value is represented by a dot, and similar ones are stacked on top of each other.

**Q8.9** If we have very many of data points (e.g. many thousands) what
type of graph might we make instead of a traditional histogram?

Answer: A density plot, which shows the shape of the distribution as a continuous line. Here we must be careful to choose appropriately the “smoothness” of the line (and this choice is quite subjective, just as is the choice of number of bins).

**Q8.10** Compare and contrast two measures of central tendency.

Answer: The mean and the median are two measures of central tendency. The sample mean is sensitive to the shape of a distribution and the presence of outliers. The sample median is less so. The median of a sample is the value separating the upper half from the lower half of the distribution.

**Q8.11** Why do we advise use of the interquartile range to summarise
the dispersion of a sample distribution?

Answer: It is simple to understand: it is the range that contains “the middle 50%” of a sample, which is given by the difference between the third and first quartiles. And it is not as affected by outliers as other measures of dispersion. And it is what is commonly displayed in box-and-whisker plots.

**Q8.12** Moving on now to summarising categorical variables, we are
more interested in how common in a sample are occurrences of each of the
categories. Why does it not make much sense to calculate a median, even
though R may let us do so?

Answer: There is no order for the categories, so we can’t say which is larger than another. If we can, then we can’t say which are more different than others.

### 14.9.2 Workflow demonstration questions and exercises

(Only one exercise in this section, since this chapter is mostly
conceptual. If you feel you need more practical experience, and have
worked through the bat diet workflow demonstration in the *Insights*
book, then consider working through one or more of the online workflow
demonstrations. Stop when you have explored the distribution, central
tendency, and dispersion/spread of some of the numeric and categorical
variables.)

**Q8.13** Try to yourself, without using a to do so directly, calculate
the median and interquartile range of the wingspan variable. Hint: it
might be useful to sort/arrange the dataset by ascending values of the
wingspan variable; and you will need to know the number of observations.
Then use the `quantile`

function to check your answer.

Answer:

```
%>%
bats summarise(num_obs = sum(!is.na(Wingspan_mm)))
```

```
## # A tibble: 1 x 1
## num_obs
## <int>
## 1 555
```

```
# 555 observations
# 25% of 555 is 139
# 50% of 555 is 278 (for the median)
# 75% of 555 is 416
%>%
bats filter(!is.na(Wingspan_mm)) %>%
arrange(Wingspan_mm) %>%
slice(c(139, 278, 416)) %>%
pull(Wingspan_mm)
```

`## [1] 29 35 39`

And check using a function:

```
%>% summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),
bats q_wing = quantile(Wingspan_mm, c(0.25, 0.5, 0.75), na.rm = TRUE))
```

```
## # A tibble: 3 x 2
## quantile q_wing
## <chr> <dbl>
## 1 25% 29
## 2 50% 35
## 3 75% 39
```

## 14.10 Chapter 9 (Making Deeper Insights Part 2: Relationships among (many) variables)

### 14.10.1 General questions and exercises

**Q9.1** When examining data for an association between two numeric
variables, what type of graph will likely be quite informative? And what
summary statistic might we use?

Answer: We will likely learn a lot from looking at a bivariate scatterplot of the observations, with one numeric variable mapped to the x-axis and the other mapped to the y-axis. We could then calculate the correlation coefficient of the association.

**Q9.2** What is a limitation of the Pearson’s correlation coefficient?

Answer: It should only be used when the associate between the two variables is relatively linear.

**Q9.3** What two correlation coefficients can be used when the
association is relatively non-linear?

Answer: Spearman’s and Kendall’s correlation coefficients.

**Q9.4** What about those two correlation coefficients makes them
appropriate for non-linear associations?

Answer: They are based on the match (or mismatch) in the ranking of the observations, rather than their values. E.g. does the smallest value in one variable associate with the smallest value in the other variable?

**Q9.5** What does a correlation coefficient of 1 mean?

Answer: There is a perfect linear association between the values in the two variables (Pearson’s correlation coefficient), or perfect match in their rankings (Spearman’s and Kendall’s correlation coefficient).

**Q9.6** When making a scatterplot with many data points, what do we
need to be careful of, in order to not mislead about the amount of data
points? And how could this be achieved?

Answer: We need to make sure that data points that could lie on top of
each other are clearly shown as multiple data points. We could, for
example, make data point in the figure larger if they represent more
than one observation. The `geom_count`

function can do this. With a lot
of data, we may decide to make a 2-dimensional histogram (e.g. with the
`geom_hex`

function in the **hexbin** package).

**Q9.7** What type of graph might we make in order to examine if there
is evidence of an association between two categorical variables?

Answer: A paired bar chart, showing the number of observations in each
combination of the two categorical variables would work. Here is the
graph we can use for the `Migratory`

variable and the `Family`

variable
in the bat diet dataset:

```
ggplot() +
geom_bar(data = bats,
mapping = aes(x = Family, fill = Migratory),
position = "dodge") + coord_flip() +
labs(x = "Family", y = "Number of Observations")
```

We could make the counts proportions and plot these:

```
ggplot(bats, aes(x = Family, fill = Migratory)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
coord_flip()
```

We can then clearly see that the proportion of migratory differs among
the Families. I.e. the proportion migratory is *contingent* on the
Family.

**Q9.8** What kind of table (i.e. what name does the table have that)
shows the number of cases for each combination, and therefore tells us
if the number of counts is dependent on the value of both categorical
variables?

Answer: A *contingency* table. I.e. is the number of migratory relative
to non-migratory prey *contingent* upon the Family?

**Q9.9** What types of graphs could we use to assess the evidence for an
association between a categorical and numeric variable?

Answer: A box-and-whisker plot is a good choice. Or just plotting all the data points if there are not so many. We could also make multiple histograms, one for each category of the categorical variable.

**Q9.10** What can we do if we would like to check for associations
among three or more variables?

Answer: Graphically we can do this for three, perhaps just about four
variables, by colouring points, and using facets. But it gets tricky and
complex. We can make all pairwise bivariate graphs, but this does not
tell us about come complexities, such as two or three way dependencies
among numeric variables. The general point is that with more variables
it quickly gets much harder to check for the complex associations that
can occur. If you’re interested in learning some useful methods, take a
look at the *Lurking variables* and *Ordination* section on this web
site.

### 14.10.2 Workflow questions and exercises

(Only one exercise in this section, since this chapter is mostly
conceptual. If you feel you need more practical experience, and have
worked through the bat diet workflow demonstration in the *Insights*
book, then consider working through one or more of the online workflow
demonstrations. Stop when you have explored the relationships among some
of the variables (which, in any case, will likely be the end of the
workflow demonstration).)

**Q9.11** Explore if there is any evidence of an associate between the
total number of reads in a poop and the number of prey observed in a
poop. Hint: you will first need to calculate the number of prey, if you
have not already. Make sure to take care that all data points are very
likely visible.

Answer:

```
%>%
bats group_by(Bat_ID) %>%
summarise(num_prey = length(Species),
num_reads = sum(No._Reads)) %>%
ggplot() +
geom_jitter(aes(x = num_reads, y = num_prey),
width = 0, height = 0.2)
```

## 14.11 Chapter 10 (Looking back and looking forward)

You made it to the last chapter – congratulations. In the last chapter, as well a congratulating you, we mention four next steps: code style, visual communication, statistical analysis, and reproducibility. We go into some detail about the what, why, and how of reproducibility. Even as we finish this web page and the book is in production, there are new requirements to publish our code and data alongside the insights. And, perhaps even more importantly, that the code (analysis script) is reproducible. So training about how to make reproducible analyses will become more and more important, and be part of the Open Science movement. And so, only one question for this chapter, and it is about reproducibility…

**Q10.1** Which of these are good practices and bad practices for
reproducible methods for getting insights from data?

- Using absolute paths to data files.
- Never editing our raw data files “by-hand.”
- Putting all our files in one folder.
- Making our code easy for others to understand, including making many useful comments.
- Use RMarkdown to make reports.
- Future-proofing your code.

Answers:

- No, use relative paths.
- Yes, if you need to alter some datapoint, do so in a transparent and recorded method, such as in code.
- No, organising files of different types (e.g. data, code, figures) in different folders will usually help others understand what we’ve been up to.
- Yes, of course make it easy for others to understand!
- Yes, use RMarkdown, as this means we don’t need to copy and paste output/graphs into a word processor document, and in doing so potentially copy the wrong thing, or forget to update the inserted graph when we should.
- Yes, we should future-proof our code. Best to use as few add-on packages as possible, and then use ones that have been around a while and are still actively maintained. If you want to really protect against changes in R and packages, look into Docker, but beware that its pretty complicated looking. It allows all software and code and data to be “containerised.” We then share the container, and not just our code and data. Probably it, or similar, will become very widespread, and hopefully much easier to use.

## 14.12 Polity, food diversity, and GDP challenge

Check to see if GDP or GDP per capita can explain variation in dietary diversity? What, if anything, would you imagine is the causation among political system, GDP, and dietary diversity (you might use the data to try to infer this, or just think about it)? You can get the GDP data from here: http://www.fao.org/faostat/en/#data/MK

Our solution (code only) is at the end of Answers chapter of this web site.

```
<- read_csv("data/Macro-Statistics_Key_Indicators_E_All_Data.csv",
gdp locale = locale(encoding = 'ISO-8859-1'))
# First fix some variable names:
names(gdp) <- str_replace_all(names(gdp), c(" " = "_"))
# Remove all non-countries
# keep only some of the elements
#gdp <- filter(gdp, Item_Code==22008 & Element_Code==6110)
<- filter(gdp, Item_Code==22014 & Element_Code==6119)
gdp <- select(gdp, -Item_Code, -Area_Code,
gdp -Element_Code, -Element, -Unit, -Item,
-ends_with("F"), -ends_with("N")) %>%
pivot_longer(names_to = "Year", values_to = "GDP", cols = 2:49) %>%
mutate(Year=as.numeric(substr(Year, 2, 5))) %>%
rename(Country=Area)
<- full_join(fbs_pol, gdp)
fbs_pol_all
<- group_by(fbs_pol_all, Country) %>%
fbs_pol_sum summarise(mean_div=mean(Diversity, na.rm=TRUE),
mean_pol=mean(Polity2, na.rm=TRUE),
mean_GDP=mean(GDP, na.rm=TRUE))
ggplot() +
geom_point(data = fbs_pol_sum,
aes(x=log10(mean_GDP), y=mean_div, col=mean_pol), size=3)
ggplot() +
geom_point(data = fbs_pol_sum,
aes(x=mean_pol, y=mean_div, col=log10(mean_GDP)), size=3)
ggplot() +
geom_point(data = fbs_pol_sum,
aes(x=mean_pol, y=log10(mean_GDP), col=mean_div), size=3)
```