Chapter 3 Questions and exercises

Below are questions and exercises related to each of the chapters of Insights book. In addition to the questions and exercises here, the fish dietary restriction Workflow Demonstration is presented here as a web page of questions, and with a web page of solutions here.

3.1 Preface

The following are multiple choice questions, in which each of the given options can be true or false.

Q0.1 Who is the intended audience/users/readers of Insights?

  • A People who have a need to efficiently, safely, and reliably derive robust insights from quantitative data.
  • B People with a good knowledge of working with data and data analysis, but not in R.
  • C Instructors/teachers of undergraduate’s first courses in data analysis, in particular in Life and Environmental Sciences.
  • D People who will benefit from a learning approach that gives confidence in their ability to succeed.

Q0.2 Insights teaches a newer (i.e. tidyverse) approach to using R, and not what might be called the “classic” or “traditional” approach. What does this mean?

  • A Learning based on the “tidyverse” packages that have revolutionised data exploration and analysis in R.
  • B A consistent and therefore easily taught and learned approach.
  • C A lot of use of square brackets and dollar signs in the R code.
  • D It allows us to very easily construct powerful, efficient, easy to troubleshoot, and just plain lovely pipelines that work on our data.
  • E Simple, powerful, flexible, and beautiful data visualisations.

Q0.3 Insights is a book about the process of getting insights from data, and yet it contains no statistical models (e.g. no regression, ANOVA, linear models, or generalised linear models. Why are such things not included?

  • A There is enough to be learned about data analysis and to be gained from data analysis without such tests. The skills covered in Insights are essential. Statistical tests should only be used to confirm what we see with our own eyes.
  • B Statistical tests can be quite daunting and difficult, so we leave them until we have a solid hold on more foundational skills in data analysis.
  • C Not including statistical tests eliminates the risk that early learning of statistical tests encourages a rather one dimensional view of data analysis focused on statistical significance and p-values.
  • D Avoiding statistics encourages us to pay more attention to the practical significance (as opposed to statistical significance.

Q0.4 What proportion of a data analysts’ time and effort is spent on tasks such as preparation, import, cleaning, tidying, checking, double-checking, manipulating, summarising and visualizating data?

Q0.5 From where do you get the datasets that you need to work along with the demonstrations in the book?

  • A E-mail one of us.
  • B From this website.
  • C Direclty from the authors of the original studies.
  • D From online data repositories where the authors of the original studies deposited their data for it to be shared and used.

Q0.6 Which one of the book’s authors works at the Natural History Museum in London?

Q0.7 Which one used to brew beer (from grain)?

Q0.8 Which one has has strangely hairy ears?

Q0.9 Which one has cuddled sheep on a remote Scottish Island?

3.2 Chapter 1 (Introduction)

Q1.1 Can we get insights from data without using complex statistical models and analyses, without machine learning, without being a master data scientist?

Q1.2 What is an advantage of focusing on the data without any statistical models/tests to think about?

Q1.3 With what should we start our journey from data to insights?

  • A dataset.
  • A graph showing the result we would like to find.
  • A clear and specific question.
  • Expectation about what we might find and why we might find it.

Q1.4 Why is it important to know if a study resulting in a dataset was a randomised manipulative experiment?

Q1.5 Datasets, i.e. convenient arrangements of data, come in many forms. Which arrangement is used throughout this book?

Q1.6 What is a “response variable” and what are other names for one?

Q1.7 What is an “explanatory variable” and what are other names for one?

Q1.8 Give five important features of datasets, and explain in your own words why they are important.

Q1.9 What do each of the four demonstration datasets used in the Workflow Demonstrations share in common?

**Q1.10* Which step in getting insights from data was missing from the presented workflow? (This is a bit of a trick question.)

3.3 Chapter 2 (Getting Acquainted)

Q2.1 True or false: R and RStudio are the same software program?

Q2.2 What will R tell us if we type into the Console 1 + 3 * 3?

Q2.3 What will R tell us if we type into the Console log(100)?

Q2.4 How would we assign to the name “my_constant” the value of “log(100)?”

Q2.5 What commands should we be typing directly into the Console?

Q2.6 What is the error in this code (hint: you do not need to know what the seq function is doing):

my_x_variable <- seq(0, 4*pi, length = 100)
my_y_variable <- sin(my_x_varible)

Q2.7 When we give a function some arguments, when should we name the arguments? E.g. when should we use round(x = 2.4326782647, digits = 2) rather than round(2.4326782647, 2)?

Q2.8 True or false: it is good practice to have at the top of our script a command to install each of the packages we would like to use. E.g. install.packages("dplyr").

Q2.9 True or false: When asking help from others it is sufficient to email them a verbal description of the problem, and a copy of the error message R.

Q2.10 If RStudio freezes, are we going to lose all our amazing work?

3.4 Chapter 3 (Workflow Demonstration–Part 1)

Q3.2 What are characteristic features of tidy data?

Q3.3 Write down two reasons why its a good idea to use the Projects that are available in RStudio.

Q3.4 What is the name given to this term %>% and what does it do?

Q3.5 A little trickier perhaps… which variable in the bat diet dataset is numeric but should, arguably, not be?

Q3.6 In what type of object do we often (and in this workflow) store data in R, and what are some features of this object.

Q3.7 We cleaned up the variable names using a string replacement functions (str_replace_all). Find the name of the add-on package that contains the rather nice function clean_names(), which can quickly and easily result in us having nice clean variable names.

Q3.8 Take a look at this version of the script of this part of the Bat Diet workflow demonstration. We have added a variety of errors into the script, and your task is to find the errors and correct them.

3.5 Chapter 4 (Workflow Demonstration–Part 2)

Q4.1 Chapter 3 mostly concerned getting the data ready. Chapter 4 is mostly about getting information from the data, including answering our questions. What two add-on package are used to do this, and what is each of them useful for?

Q4.2 What do we mean by “sanity-checking insights?”

Q4.3 How many times will you accidentally type in your code = when you meant to type ==, and what is the difference?

Q4.4 Imagine we calculate the mean of a variable and the answer, to our surprise, is NA. This is a surprise because we know the variable is numeric and we were therefore expecting a number. What is a likely reason for us getting NA, and what is one way of dealing with this?

Q4.5 Imagine that we observe that the association between two variables seems to depend on a third variable. E.g. the effects of herbivory on plant biomass is negative in the absence of a predator, but is positive in the presence of a predator. Which of these terms might be appropriate to describe this kind of pattern?:

  • An interaction.
  • Context dependence.
  • An association.
  • Independence.

Q4.6 We made a type of graph that allowed us to see the number of observations of a variable that had particular values (i.e. the number of values that fell in a particular bin)? What is a name of this type of graph, and why is it import to look at this?

Q4.7 We counted (calculated) the number of different prey species detected in the poop of each bat. What is the smallest value this number could take, and why is this important to realise?

Q4.8 In chapter 4 (and at many other places in Insights) we leave the axes labels of the displayed graphs with relatively ugly labels. E.g. in Figure 4.9 the y-axis label is num_prey. Why don’t we make these and other labels nicer?

4.9 Did you follow and understand the second part of this chapter “A prey-centric view” of the data and questions? (Quick answer: we (the authors) would be happy if you did decide to skip this section until you worked through the following chapters!)

Q4.10 The workflow demonstration presented in chapter 3 and 4 were quite linear… we did one thing, then built on it, then another building on that. What did we do in reality, while developing the workflow demonstration?

3.6 Chapter 5 (Dealing with data 1—Digging into dplyr)

3.6.1 General questions and exercises

Q5.1 What dplyr function (i.e. function from the dplyr add-on package) do we use to calculate summary information about data?

Q5.2 What dplyr function do we use to keep a collection of rows/observations in a dataset, according to the values in one of the variables?

Q5.3 What is %in% used for? (We ask because it is frequently very useful.)

Q5.4 What What dplyr function do we use to add a new variable to a dataset that is a transformation/modification of an existing variable?

Q5.5 List four other dplyr functions and write about what they’re used for.

3.6.2 Bat diet workflow questions and exercises

Q5.6 Find the identity (Bat_ID) of the two bats that ate only the largest (52.5mm wingspan) type of prey.

Q5.7 How many different prey species were detected in total?

Q5.8 The following code is intended to calculate the number of prey items found in each poop. Find and correct the three intentional errors it contains:

 prey_stats < bats %>%
  group_by(Bat_Id) %>%
  summarise(num_prey = n()

Q5.9 Calculate the number of times each prey item was observed.

Q5.10 Calculate number of migratory and non-migratory prey species, and pest or non-pest, and each combination of migratory and pest.

Q5.11 What was the maximum number of times a prey species could have been observed?

Q5.12 What proportion of prey species were observed in only one poo?

Q5.13 How many bats were caught on each of the dates?

Q5.14 The Abstract of the paper states that Lepidoptera were mostly from the Noctuidae and Geometridae familes. How many species of Noctuidae and Geometridae are there in the dataset?

Q5.15 The paper states that 56.9±36.7% were migratory moth species. Calculate this yourself.

Q5.16 Confirm the results from the paper: Moths (Lepidoptera; mainly Noctuidae and Geometridae) were by far the most frequently recorded prey, occurring in nearly all samples and accounting for 96 out of 115 prey taxa.

Q5.17 Confirm the results from the paper: Each pellet [poo] contained on average 4.1 ± 2.2 prey items

3.7 Chapter 6 (Dealing with data 2—Expanding your toolkit)

3.7.1 General questions and exercises

Q6.1 Describe in your own words a couple of reasons why piping (using %>%) is useful.

Q6.2 Manipulating strings is a very important skill. In your own words, describe what the following functions do: str_replace_all, case_when, separate, and grepl (all of which are mentioned in the book).

Q6.3 In your own words describe three other string manipulation functions.

Q6.4 We can use the function dmonths(1) to get the duration of one month, according to R. In which case we see that the duration is 2629800s (~4.35 weeks). What does this mean about how lubridate calculates this duration of a month?

Q6.5 Working with dates and times can be rather tricky and painful. What add-on package do we recommend for working with dates and times, because it contains many very useful and simple-to-use functions?

Q6.6 What would the function ymd be useful for?

Q6.7 Why is it useful to convert strings containing dates and times into date and time formatted variables?

Q6.8 In your own words write down what the pivot_longer function is used for, and also the three arguments (and their meanings) that we usually give it.

Q6.9 Look at the online Workflow Demonstration “Food-diversity Polity” and work through tidying (particularly the use of pivot_longer) the FAO Food Balance Sheet data.

3.7.2 Bat diet workflow questions and exercises

Q6.10 Make a pipeline of operations, just like the one in the Insights book, that will calculate the number of prey species in the poop of each bat, and the average number of prey per bat for each taxonomic Order.

Q6.11 Make a variable containing both the age and sex of the bats. E.g. “Adult-Female,” “Adult-Male,” and so on.

Q6.12 How long did the fieldwork take, from the first sample to the last?

Q6.13 Change the arrangement of the bat diet data to be wide-format. That is, for each bat make there be only one row, and have a column for each of the possible prey species, with number of reads at the entry.

Q6.14 If you have NA’s in the wide dataset you made for the answer to the previous question, figure out how to replace these with zeros. Hint: you do not need to use another function or operation.

Q6.15 How many rows will there be when we change this bats_wide dataset back to long format?

Q6.16 And now switch the bats_wide dataset back to long format, and the make a new column that is a binary prey presence/absence variable.

Q6.17 Confirm that the number of presences is the same as in the original dataset.

3.8 Chapter 7 (Getting to grips with ggplot2)

3.8.1 General questions and exercises

Q7.1 What is the difference between ggplot and ggplot?

Q7.2 In the context of a graph being made with ggplot, give two examples of aesthetic mappings, and give the function we use to specify them.

Q7.3 In the context of ggplot2, what is a “scale?”

Q7.4 In the context of ggplot2, what is a “layer?”

Q7.5 In the context of ggplot2, what is a “facet?”

3.8.2 Bat diet workflow questions and exercises

Q7.6 Plot at the shape of the distribution of number of poops collected on each sampling date.

Q7.7 Plot the distribution of number of reads. What shape is it? From looking at the distribution, predict whether the mean or median is larger. Then check your prediction by calculating the median and mean. For extra credit, add two vertical lines to the graph, one where the mean is, and one where the median is (hint, use the geom_vline function).

## `summarise()` has grouped output by 'Bat_ID', 'Sex'. You can override using the `.groups` argument.

Q7.8 The following code is intended to create a plot of the wingspan of prey found in male and female bat poops. Find and correct the three intentional errors it contains:

 prey_stats %>%
  ggplot() %>%
  geom_beewarm(mapping = aes(x = Sex y = mean_wingspan))

Q7.9 Calculate and plot a histogram of the probability of observing each of the prey species across the whole study.

Q7.10 Ensure that you have the odds ratio for appearing a male or female poo for each of the prey species (the odds_ratio object in the book). Plot a histogram of all the calculated odds ratios. Guess how many prey species with odds less than twice and less than half to appear in female compared to male poos. Calculate how many, and see how close is your guess. (Hint: it may be worthwhile to plot log2 of the odds ratio.)

Q7.11 And now for a little bit of fun: Combine the script that displays a photo of a cute dog with the script that gives some praise to create a motivational poster of a dog giving praise.

3.9 Chapter 8 (Making Deeper Insights: Part 1 - working with single variables)

3.9.1 General questions and exercises

Q8.1 Give an example of a continuous numeric variable and explain why it is so.

Q8.2 Give an example of a discrete numeric variable and explain why it is so.

Q8.3 Give an example of a categorical variable, and state if it is ordinal or nominal.

Q8.4 Numeric variables can also be said to be measure on an interval or a ratio scale. Explain the difference and give examples of each.

Q8.5 Explain in your own words what is meant by a “sample distribution.”

Q8.6 What type of graph is a very good one for getting a summarising a numeric variable?

Q8.7 When we make a histogram we much choose the number of bins. What is good practice when doing so?

Q8.8 If we have rather few data points (e.g. less than 30 or so), what type of graph might we make instead of a traditional histogram?

Q8.9 If we have very many of data points (e.g. many thousands) what type of graph might we make instead of a traditional histogram?

Q8.10 Compare and contrast two measures of central tendency.

Q8.11 Why do we advise use of the interquartile range to summarise the dispersion of a sample distribution?

Q8.12 Moving on now to summarising categorical variables, we are more interested in how common in a sample are occurrences of each of the categories. Why does it not make much sense to calculate a median, even though R may let us do so?

3.9.2 Workflow demonstration questions and exercises

(Only one exercise in this section, since this chapter is mostly conceptual. If you feel you need more practical experience, and have worked through the bat diet workflow demonstration in the Insights book, then consider working through one or more of the online workflow demonstrations. Stop when you have explored the distribution, central tendency, and dispersion/spread of some of the numeric and categorical variables.)

Q8.13 Try to yourself, without using a to do so directly, calculate the median and interquartile range of the wingspan variable. Hint: it might be useful to sort/arrange the dataset by ascending values of the wingspan variable; and you will need to know the number of observations. Then use the quantile function to check your answer.

3.10 Chapter 9 (Making Deeper Insights Part 2: Relationships among (many) variables)

Q9.1 When examining data for an association between two numeric variables, what type of graph will likely be quite informative? And what summary statistic might we use?

Q9.2 What is a limitation of the Pearson’s correlation coefficient?

Q9.3 What two correlation coefficients can be used when the association is relatively non-linear?

Q9.4 What about those two correlation coefficients makes them appropriate for non-linear associations?

Q9.5 What does a correlation coefficient of 1 mean?

Q9.6 When making a scatterplot with many data points, what do we need to be careful of, in order to not mislead about the amount of data points? And how could this be achieved?

Q9.7 What type of graph might we make in order to examine if there is evidence of an association between two categorical variables?

Q9.8 What kind of table (i.e. what name does the table have that) shows the number of cases for each combination, and therefore tells us if the number of counts is dependent on the value of both categorical variables?

Q9.9 What types of graphs could we use to assess the evidence for an association between a categorical and numeric variable?

Q9.10 What can we do if we would like to check for associations among three or more variables?

3.10.1 Workflow questions and exercises

(Only one exercise in this section, since this chapter is mostly conceptual. If you feel you need more practical experience, and have worked through the bat diet workflow demonstration in the Insights book, then consider working through one or more of the online workflow demonstrations. Stop when you have explored the relationships among some of the variables (which, in any case, will likely be the end of the workflow demonstration).)

Q9.11 Explore if there is any evidence of an associate between the total number of reads in a poop and the number of prey observed in a poop. Hint: you will first need to calculate the number of prey, if you have not already. Make sure to take care that all data points are very likely visible.

3.11 Chapter 10 (Looking back and looking forward)

You made it to the last chapter – congratulations. In the last chapter, as well a congratulating you, we mention four next steps: code style, visual communication, statistical analysis, and reproducibility. We go into some detail about the what, why, and how of reproducibility. Even as we finish this web page and the book is in production, there are new requirements to publish our code and data alongside the insights. And, perhaps even more importantly, that the code (analysis script) is reproducible. So training about how to make reproducible analyses will become more and more important, and be part of the Open Science movement. And so, only one question for this chapter, and it is about reproducibility…

Q10.1 Which of these are good practices and bad practices for reproducible methods for getting insights from data?

  • Using absolute paths to data files.
  • Never editing our raw data files “by-hand.”
  • Putting all our files in one folder.
  • Making our code easy for others to understand, including making many useful comments.
  • Use RMarkdown to make reports.
  • Future-proofing your code.