• Insights from Data Website
  • 1 Introduction
  • 2 Insights Workflow
    • 2.1 Preparing the data
    • 2.2 Prepare your computer, R, and RStudio
    • 2.3 Read the data into R
    • 2.4 Tidy the data
    • 2.5 Clean the data.
    • 2.6 Initial insights
    • 2.7 Insights about question posed
  • 3 Questions and exercises
    • 3.1 Preface
    • 3.2 Chapter 1 (Introduction)
    • 3.3 Chapter 2 (Getting Acquainted)
    • 3.4 Chapter 3 (Workflow Demonstration–Part 1)
    • 3.5 Chapter 4 (Workflow Demonstration–Part 2)
    • 3.6 Chapter 5 (Dealing with data 1—Digging into dplyr)
      • 3.6.1 General questions and exercises
      • 3.6.2 Bat diet workflow questions and exercises
    • 3.7 Chapter 6 (Dealing with data 2—Expanding your toolkit)
      • 3.7.1 General questions and exercises
      • 3.7.2 Bat diet workflow questions and exercises
    • 3.8 Chapter 7 (Getting to grips with ggplot2)
      • 3.8.1 General questions and exercises
      • 3.8.2 Bat diet workflow questions and exercises
    • 3.9 Chapter 8 (Making Deeper Insights: Part 1 - working with single variables)
      • 3.9.1 General questions and exercises
      • 3.9.2 Workflow demonstration questions and exercises
    • 3.10 Chapter 9 (Making Deeper Insights Part 2: Relationships among (many) variables)
      • 3.10.1 Workflow questions and exercises
    • 3.11 Chapter 10 (Looking back and looking forward)
  • 4 More R
    • 4.1 RStudio Project setup
    • 4.2 Base/classic and tidyverse comparison
    • 4.3 Multiple graphs in one figure
    • 4.4 Factors
    • 4.5 Other pipes
    • 4.6 Simulating data
    • 4.7 Avoiding “loops”
    • 4.8 Syntax highlighting
    • 4.9 Summarise more than one variable
  • 5 Data analysis concepts
    • 5.1 Distributions
    • 5.2 Interactions and complexity
    • 5.3 Lurking variables
    • 5.4 Power of data to give insights
    • 5.5 Effect sizes
    • 5.6 Ordination
    • 5.7 Influence and outliers
    • 5.8 Transformations
    • 5.9 Non-independence
    • 5.10 Missing values (NAs)
    • 5.11 Skewness
    • 5.12 Interoperability / standardising terms
    • 5.13 Comparing descriptive statistics
  • 6 How does dietary diversity affect populations?
    • 6.1 About this Workflow Demonstration
    • 6.2 Going to the next level
    • 6.3 Introduction to the study and data
    • 6.4 What type of response variable?
    • 6.5 A little preparation
    • 6.6 Acquire the dataset
    • 6.7 Import the dataset
    • 6.8 Checking the import worked correctly
    • 6.9 Cleaning and tidying
      • 6.9.1 Recode some names
      • 6.9.2 Make the prey_composition variable a factor with specific order
      • 6.9.3 Fix those variable names
      • 6.9.4 Calculate an important variable
      • 6.9.5 Remove NAs
      • 6.9.6 Checking some specifics
      • 6.9.7 A closer look at the data
      • 6.9.8 Calculate the three response variables
    • 6.10 Shapes
    • 6.11 Relationships
      • 6.11.1 Maximum predator density
      • 6.11.2 Predator population variability (CV)
      • 6.11.3 Predator persistence time
      • 6.11.4 All three at once
    • 6.12 Wrapping up
  • 7 Are diets more diverse in more democratic countries?
    • 7.1 About this Workflow Demonstration
    • 7.2 Introduction to the study and data
    • 7.3 Understanding the data
    • 7.4 A little preparation
    • 7.5 Polity data: origins, acquire, import, clean, tidy, NAs, duplicates
      • 7.5.1 Data origins and acquisition
      • 7.5.2 Data import
      • 7.5.3 Tidy and clean
      • 7.5.4 Deal with NAs
      • 7.5.5 Check for innapropriate duplicate observations
      • 7.5.6 Check ranges of numeric variables
    • 7.6 First insights from the polity data
    • 7.7 Acquire, import, check the FAO Food balance sheet data
      • 7.7.1 Tidy the FAO data
      • 7.7.2 Clean the FAO data
      • 7.7.3 Check for innapropriate duplicate observations
      • 7.7.4 Checking something else…
      • 7.7.5 Missing values
      • 7.7.6 More cleaning
      • 7.7.7 Calculating our response variables
    • 7.8 Merge the two datasets (aaaaaargh!!)
      • 7.8.1 Polity standardisation
      • 7.8.2 The final merge
    • 7.9 Tidying up
    • 7.10 Shapes
    • 7.11 Relationships
    • 7.12 Wrapping up
    • 7.13 And a challenge for you…
  • 8 What are the effects of dietary restriction?
    • 8.1 About this Workflow Demonstration
    • 8.2 Introduction
    • 8.3 The question
    • 8.4 Before working in R
    • 8.5 What was the experimental design?
    • 8.6 What are the features of the data?
    • 8.7 Acquire and import the necessary datafiles.
    • 8.8 Explore and understand the datafiles
    • 8.9 Check the data import
    • 8.10 Make more informative variable names (and discard variables not obviously of use):
    • 8.11 Replace codes with informative words
    • 8.12 Checking for duplicates
    • 8.13 NAs, variable entries, e.g. levels of characters, ranges of numerics, numbers of “things”
    • 8.14 Independence
    • 8.15 Balance in experimental design
    • 8.16 Calculate response variable(s) (if required)
    • 8.17 Merge all datasets together and check for correct number of rows
    • 8.18 Something a bit weird…
    • 8.19 Import the updated versions of the datasets.
    • 8.20 Inspect shapes (distributions)
    • 8.21 Inspect relationships
  • 9 Solutions: What are the effects of dietary restriction?
    • 9.1 About these solutions
      • 9.1.1 NAs, variable entries, e.g. levels of characters, ranges of numerics, numbers of “things”*
    • 9.2 Independence
    • 9.3 Balance in experimental design
    • 9.4 Calculate response variable(s) (if required)
    • 9.5 Inspect relationships
    • 9.6 Below are lists of variables in each of the four used datasets.
    • 9.7 Moatt et al Data S1 – Mortality Data
    • 9.8 Moatt et al Data S5 – Courtship Data
    • 9.9 Moatt et al Data S6 – Eggs Data
    • 9.10 Moatt et al Data S15 – Length, Weight and Condition Index Data
  • 10 Workflow demonstration R scripts
  • 11 Live data analysis demonstration
    • 11.1 Introduction for intructors
    • Introduction
    • 11.2 Meta-task
    • 11.3 The question
    • 11.4 Expectation
    • 11.5 How are we going to present the results?
    • 11.6 What statistical test will we use?
    • 11.7 Selection of subjects
    • 11.8 Ethical clearance and considerations
    • 11.9 Data collection
    • 11.10 Look at the data!
    • 11.11 Lets get the data into our data analysis software of choice (R, via RStudio)
    • 11.12 Now we need to do some data wrangling (cleaning and tidying)
      • 11.12.1 Clean up the column / variable names:
      • 11.12.2 Check the variable types are correct.
      • 11.12.3 Correct or exclude problematic data
      • 11.12.4 Check numbers of data points in each sex
      • 11.12.5 Check the number of observations
    • 11.13 Visualise the data
    • 11.14 Get the means
    • 11.15 Effect size and practical importance?
    • 11.16 Assess assumptions
      • 11.16.1 Independence
      • 11.16.2 Normally distributed residuals
      • 11.16.3 Equal variance
    • 11.17 Do the statistical test
    • 11.18 Critical thinking
    • 11.19 Report and communicate the results
      • 11.19.1 The results as a sentence
      • 11.19.2 The results graphically
      • 11.19.3 Do not use a table
  • 12 More datasets
    • 12.1 Hungry ladybirds
    • 12.2 Seal suppers
    • 12.3 More bat poop
    • 12.4 Marten isotopes
    • 12.5 Snake diets
    • 12.6 Desert bat diets
    • 12.7 Birds eating insects
    • 12.8 Diets of predatory fish
    • 12.9 Cervical spine compression and MRI (not food related)
    • 12.10 Lots of other datasets here:
    • 12.11 Fish eye lens diets
  • 13 Related reading
    • 13.1 Data science related reading
    • 13.2 Study design related reading
    • 13.3 Web sites / pages
  • 14 Answers and solutions
    • 14.1 Preface
    • 14.2 Chapter 1 (Introduction)
    • 14.3 Chapter 2 (Getting Acquainted)
    • 14.4 Chapter 3 (Workflow Demonstration–Part 1)
    • 14.5 Chapter 4 (Workflow Demonstration–Part 2)
    • 14.6 Chapter 5 (Dealing with data 1—Digging into dplyr)
      • 14.6.1 General questions and exercises
      • 14.6.2 Bat diet workflow questions and exercises
    • 14.7 Chapter 6 (Dealing with data 2—Expanding your toolkit)
      • 14.7.1 General questions and exercises
      • 14.7.2 Bat diet workflow questions and exercises
    • 14.8 Chapter 7 (Getting to grips with ggplot2)
      • 14.8.1 General questions and exercises
      • 14.8.2 Bat diet workflow questions and exercises
    • 14.9 Chapter 8 (Making Deeper Insights: Part 1 - working with single variables)
      • 14.9.1 General questions and exercises
      • 14.9.2 Workflow demonstration questions and exercises
    • 14.10 Chapter 9 (Making Deeper Insights Part 2: Relationships among (many) variables)
      • 14.10.1 General questions and exercises
      • 14.10.2 Workflow questions and exercises
    • 14.11 Chapter 10 (Looking back and looking forward)
    • 14.12 Polity, food diversity, and GDP challenge
  • 15 Corrections
  • Published with bookdown

Companion Website — Insights from Data with R

Chapter 13 Related reading

Please let us know if you can recommend related reading that is not already mentioned below.

  • Grafen & Hails (2002) Modern Statistics for the Life Sciences. 368 pages. Focuses on and thoroughly covers statistics, using general linear models. Works with Minitab, SAS, SPSS.
  • Crawley (2005) Statistics - An Introduction Using R. 327 pages. A concise introduction focused on statistical analyses using R. Crawley (2012) The R Book. 1076 pages. A comparatively encyclopedic account of R; “extensive and comprehensive.” Hothorn & Everitt (2014) A Handbook of Statistical Analysis Using R. 456 pages. Focuses on statistical analyses; probably more graduate level.
  • Whitlock & Schluter (2015) The Analysis of Biological Data. 818 pages. Contains practice & assignment problems. Focused on statistics, covers data management/visualization in passing.
  • Maindonald & Braun (2010) Data Analysis and Graphics Using R. 549 pages. Assumes some existing knowledge of statistics and data analysis. For final year undergraduate / graduate level. Reaches to Bayesian methods, GLMMs, and random forests.
  • Hector (2015) The New Statistics with R. 199 pages. Focused on statistics, specifically linear models. “New” refers to new methods that are included, and focusing on effect sizes rather than p-values.
  • Field, Miles, & Field (2012) Discovering Statistics using R. 957 pages. Focused on statistics, though covers data management and visualization. Goes up to multilevel linear models. Classic R and R Commander (no RStudio). Written with humour, has “characters,” associated website with datasets, scripts, webcasts, self-assessment question, additional material, answers, powerpoint slides, links, and cyberworms of knowledge.
  • Field (2016) An Adventure in Statistics. 768 pages. At first (and perhaps later) sight quite inspirational. Starts with a chapter on why we need science (maybe to get insights?) followed by one on reporting findings. As such, has similar approach to Insights, to start with motivation and with the end in mind. Continues with a thorough account of data analysis and statistics suitable for undergraduates.
  • Bolker (2008) Ecological Models and Data in R. 396 pages. Page 3 states “I assume that you’ve had the equivalent of a one-semester undergraduate statistics course…” and on page 4 “If you have used R already, you’ll have a big head start.” Venables, Smith, et al (2009) An Introduction to R. Reference book for the R Language (classic R). Very concise. Contains a 15-page chapter on statistics, including linear and non-linear models.
  • Grolemund & Wickham (2017) R for Data Science. 492 pages. Focus on “Data Science,” “an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.” Book organized broadly by the workflow: Explore, Wrangle, Program, Model, Communicate. Quite comprehensive in coverage of the “tidyverse” approach to using R.
  • McKillup (2012) Statistics Explained. An Introductory Guide for Life Scientists. 400 pages. Quite well rounded, including experimental design, collecting and displaying data, doing science, ethics. Majority walks through statistical tests… linear models, non-parametric tests, multivariate.
  • Dytham (2010) Choosing and Using Statistics: A Biologist’s Guide. 320 pages. Focused on statistics, as the title suggests.
  • Adler (2012) R in a Nutshell. A Desktop Quick Reference. 611 pages. A great reference book.
  • Dalgaard (2008) Introductory Statistics with R. 364 pages. A concise introduction focused on statistical analyses using R.
  • Spector (2008) Data Manipulation with R. 154 pages. Covers importing data, working with databases, character manipulation, dealing with dates, using loops, conversion to data frames.
  • Ellis (2010) The Essential Guide to Effect Sizes. 188 pages. Focuses on interpreting the practical everyday importance of research results, power, and synthesizing disparate results. Does this via effect sizes. Based on a course for honed on “smart graduate students.”
  • Gotelli & Ellison (2012) A Primer of Ecological Statistics. 614 pages. Upper-undergraduate to graduate level. Probability and statistical thinking, distributions, central tendency and spread, p-values, etc. Then experimental design; then specific analyses. Finishes by covering estimates of diversity and occurrence.
  • Gonick & Smith (1993) The Cartoon Guide to Statistics. 230 pages. Covers summary and display of data, probability, central limit theorem, confidence interval estimation, etc.
  • McKillup (2011) Statistics Explained. An Introductory Guide for Life Scientists. 416 pages. Begins by explaining about doing science, collecting and displaying data, experimental design, and responsibility and ethics. Then works through a good list of statistical methods for beginning to upper-level undergraduates.
  • Sokal & Rohlf (1995) Biometry. The Principles and Practices of Statistics in Biological Research. 880 pages. Thorough, comprehensive, and often quite technical title focused on statistics.
  • Zar (2010) Biostatistical Analysis. 960 pages. Thorough and comprehensive coverage of “statistics analysis methods used by researchers to collect, summarise, analyse and draw conclusions from biological research. Suitable for beginners to advanced users.
  • McElreath (2016) Statistical Rethinking. 469 pages. Brilliant. What should be taught to undergraduates, if only the world would then be ready for them.
  • Healy (2017) Data Visualisation for Social Science. A practical introduction with R and ggplot2. Focuses on appropriate visualization for getting knowledge from data. Covers principles and practices of looking and presenting data.
  • Zumel & Mount (2019) Practical Data Science with R.

13.1 Data science related reading

  • This awesome web site (take a few seconds to start-up): Wrangling penguins: some basic data wrangling in R with dplyr

  • Data Science for Undergraduates: Opportunities and Options. (2018) National Academies Press, Washington, D.C.

  • https://teachdatascience.com/peerj/

13.2 Study design related reading

  • https://www.amazon.com/Asking-Questions-Biology-Experimental-Presentation/dp/1292085991

13.3 Web sites / pages

While writing Insights1 we came across and benefitted from from looking at lots of lovely web sites and ideas. Here are a few that we particularly liked.

  • Fundaments of data visualisation by Claus O. Wilke. An online preview of the book “Fundamentals of Data Visualization” to be published with O’Reilly Media, Inc. (Maybe published by now.) Beautiful ggplot focused compiliation of visualisation guidelines and examples. R code available here: https://github.com/clauswilke/dataviz.

  • The Financial Times Visual Vocabulary website, made “to assist designers and journalists to select the optimal symbology for data visualisations.” Great for scientists too!

  • What they forgot to teach you about R, by Jennifer Bryan, Jim Hester. “We focus on building holistic and project-oriented workflows that address the most common sources of friction in data analysis, outside of doing the statistical analysis itself.”

  • In some situation we did not discuss in the book, such as with RMarkdown, one can experience a bit of trouble wit RProjects and paths. This is solved by this lovely function here, by Kirill Müller. Here is an amusing page about here

  • Animations of joining/merging two datasets.

  • Some advice on good practice (and bad) for naming files by Jennifer Bryan.

  • What promises to be a nice add-on to pipes, by Benjamin Elbers, giving brief reports in the Console about what happened at each step of the pipe: tidylog. Not so clear this would be stable/persistent at time of writing Insights1, so we include it here rather than in the text.

  • Simply Statistics. A nice website of articles about data analyses.

  • A tweet and replies about p-values including an excerpt from an article discussing The American Statistical Association statement about p-values: Practices that reduce data analysis or scientific infer- ence to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making.

  • Tweets about how beautifying graphs can be very effective procrastination procrastination by beautification, somewhat humorous, but also with an element of truth.

  • Tweets about the weird things we sometimes find in datasets.

  • A nice way to get into regular expressions: RVerbalExpressions. “The goal of RVerbalExpressions is to make it easier to construct regular expressions using grammar and functionality inspired by VerbalExpressions. Usage of %>% is encouraged to build expressions in a chain like fashion.”

  • 7 Reasons for policy professionals to get into R programming in 2019

  • Excel is obsolete. Use R (and Python) instead.

  • We haven’t got or read this, but will let you know if we do! Principles of Strategic Data Science

  • Warning: a hideously intrusive website… Forbes, You Can Reduce Business Risk By Phasing Out Spreadsheets For Business.