Chapter 2 Insights Workflow

Here are the steps of our general workflow. It is quite common and uncontroversial. In Chapter 1 of the Insights book we describe the workflow and then demonstrate in detail in Chapters 3 and 4.

Planning and preparation steps:

  • Decide upon a question.
  • Sketch a graph of a likely or possible answer. This can and should be the graph that we will use to answer the question and to communicate the answer.
  • Do some research about what one might expect the answer to be, and why.
  • Get feedback from experts, peers, mentors, and revise the question appropriately.
  • Specify hypotheses.
  • Get feedback and revise.
  • Specify response variables, though this may have already been done in early steps.
  • Make predictions.
  • Design your study, including planning analyses.
  • Get feedback and revise.
  • Write a formal proposal in which all of the above are described. It is good practice to pre-register your proposal.

Performing the study and getting insights steps:

  • Perform your study, including data collection.
  • Backup raw data.
  • Prepare the data (make it “research and re-use ready”).
  • Prepare computer, R, and RStudio.
  • Read the data into R, and refine the import as required.
  • Tidy the data.
  • Clean the data.
  • Create the initial insights.
  • Create insights about your question (i.e. answer the question).

Communicate your answer and its implications.

  • This step is of great importance and a whole field of its own. For example, how to give great oral presentations, and how to write great reports, are subjects about which many courses, videos, books, and blogs have been made.

Below we go into a little more detail about some of the second set of steps (Performing the study and getting insights steps). These are covered in even more detail the book.

2.1 Preparing the data

If you don’t know, e.g. because you did not conduct the study, then before importing the data files into R, be sure to inspect them in a spreadsheet program (so long as they’re not too big) and note the following:

  • if multiple data files are used, which contains what.
  • what variable names are used in the data files, and what these mean (e.g. which are response variables, which are explanatory, and what are others).
  • the number of rows and columns in the data files.
  • the arrangement of the data in the data file, e.g. tidy or not tidy.
  • any obvious things to deal with (e.g. how missing values are coded, date/time information, codes that need expanding, variable/column names that will need changing).

2.2 Prepare your computer, R, and RStudio

We strongly advise you to use projects in RStudio. More details about this in Chapter 3 of the Insights book.

  • Make a project folder, and data sub-folder.
  • Make other sub-folders, as you like.
  • Create the Project in RStudio.
  • Create a new script file.
  • Describe the project in the new script, include date, authors, project title.
  • Add to the script any required add-on libraries.

2.3 Read the data into R

Read in the data and then check the following at least:

  • Check number of variables/columns.
  • Check number of rows.
  • Check variable types.
  • Check appropriate representation of missing values.

Note that if you are attempting to import a comma separate values (CSV) formatted data file, and strangely find that it is imported into R with only one column, it may be that you instead have a semi-colon (;) as the separator. This can happen if we open a CSV file in Excel and then allow Excel to save it (even though we changed nothing). One solution is to use read_csv2() to read the data file, as this uses a semi-colon as the separator.

2.4 Tidy the data

  • Ensure there is one observation per row.
  • Ensure one type of information is not spread across multiple columns.
  • If there is more than one observation of the same type of information spread across columns, then gather observations into a single column (i.e. tidy the data).

2.5 Clean the data.

  • Check for and resolve any inappropriate duplicates.
  • Make any date and time type data be stored in R in date/time format.
  • Replace any codes with informative words.
  • Check for appropriate/plausible variable entries, e.g. levels of characters, ranges of numerics.

2.6 Initial insights

  • Check numbers of “things,” number of experimental units, treatments, treatment combinations, temporal samples.
  • Calculate response and/or explanatory variable(s) (if required).
  • Examine the shapes of the distributions of variables (i.e. inspect the histograms of explanatory and response variables).
  • Examine relationships among explanatory variables. Are they correlated, and if so what effect might this have on our ability to answer our question(s).

2.7 Insights about question posed

  • Reveal and examine relationships relevant to hypotheses/predictions.
  • Assess confidence in revealed patterns.