Chapter 5 Lab 5

5.1 Pre-class activity

5.2 Wrangling Data

It would be nice to always get data formatted in the way that you want it, but alas, one of the challenges as a scientist is dealing with Other People’s Data. People often structure data in ways that is convenient for data entry, but not very convenient for data analysis, and so much effort must be expended ’wrangling’ data into shape before you can do more interesting things with it. Also, performing analyses often requires pulling together data obtained from different sources: you have done this in semester 1 by combining the participant information with the depression and happiness data. In this lesson, we are going to give you some tips on how to structure data, and introduce strategies for transforming and combining data from different sources.

5.3 The Autism Quotient (AQ)

To illustrate these concepts, we will use some actual survey data that was collected online using SurveyMonkey.com (Note: the data have been slightly cleaned up to make things a bit simpler; in reality, the data files were even messier than it appears!) For this research project, prospective participants completed the short 10-item version of the Autism-Spectrum Quotient (AQ) (Baron-Cohen, Wheelwright, Skinner, Martin, & Clubley, 2001), which is designed to measure autistic traits in adults. The data from Survey Monkey was downloaded as a .csv file, a common text format that you should be familiar with by now.

Table 1: The ten items on the AQ-10.

Q No. Question
Q1 I often notice small sounds when others do not.
Q2 I usually concentrate more on the whole picture, rather than small details.
Q3 I find it easy to do more than one thing at once.
Q4 If there is an interruption, I can switch back to what I was doing very quickly.
Q5 I find it easy to read between the lines when someone is talking to me.
Q6 I know how to tell if someone listening to me is getting bored.
Q7 When I’m reading a story, I find it difficult to work out the characters’ intentions.
Q8 I like to collect information about categories of things.
Q9 I find it easy to work out what someone is thinking or feeling just by looking at their face.
Q10 I find it difficult to work out people’s intentions.

Responses to each item were measured on a four-point scale: Definitely Disagree, Slightly Disagree, Slightly Agree, Definitely Agree. To avoid response bias, each question is scored according to one of two different question formats. For questions 1, 7, 8, and 10, a point is assigned for agreement (either “Slightly Agree” or “Definitely Agree”) and zero otherwise. We will call this format “F” (for forward). The remaining questions are reverse coded (format “R”): a point is assigned for disagreement (“Slightly Disagree” or “Definitely Disagree”) and zero otherwise.

The AQ for each participant is just the total points across all 10 questions. The higher the AQ score, the more ’autistic traits’ they are assumed to exhibit. Our goal is to calculate an AQ score for each participant in the dataset. Download the data here: AQ Data

The data should look like this: Data

To preserve anonymity, each participant was given a unique number, coded by the variable Id, shown in the first column of the table. The rest of the columns contain the responses associated with that participant for each of the 10 question (Q1, Q2, Q3, …, Q10).

This data format is known as wide format: each unit of analysis (participant) forms a single row in the table, and each observation on that unit forms a separate column. Wide format is often convenient for data entry, but for reasons that will soon become apparent, we will find this format inconvenient for the task of scoring the responses.

5.4 Thinking through the problem

How would you go about calculating AQ scores for each participant? Of course, you could do this by hand, but you have 10 responses for each of 66 participants. That makes 660 responses to code by hand. Not only is this a horribly mind-numbing task, it is also one in which you will be prone to make errors, especially as you get bored and your mind begins to wander. Even if you are 99% accurate, that means you will still get about 7 of the scores wrong. Worst of all, this approach does not scale beyond small datasets. As this survey was done over the internet, it could have easily been the case that you ended up with 6,600 participants instead of just 66.

Learn how to make the computer do the mind-numbing tasks; it won’t make mistakes, and will free up your mind to focus on the bigger issues in your research. These investments in building your data science skills will pay off handsomely in the long run as these are skills we as researchers use everyday.

OK, let’s imagine we are doing the task by hand so that we understand the logic. Once that logic is clear, we’ll go through it again and show you how to write the script to make it happen.

Let’s take stock of what we know. First, we know that there are two question formats, and that questions Q1, Q7, Q8, and Q10 are scored according to format F and questions Q2, Q3, Q4, Q5, Q6, and Q9 are scored according to format R. We can represent this information in the table below:

Question QFormat
Q1 F
Q2 R
Q3 R
Q4 R
Q5 R
Q6 R
Q7 F
Q8 F
Q9 R
Q10 F

We also know that for format F, we award a point for agree, zero for disagree; and for format R, a point for disagree, zero for agree. We can represent that information as:

QFormat Response Score
F Definitely Agree 1
F Slightly Agree 1
F Slightly Disagree 0
F Definitely Disagree 0
R Definitely Agree 0
R Slightly Agree 0
R Slightly Disagree 1
R Definitely Disagree 1

Now that we have put that information into tables, it just becomes a simple matter of looking up the format based on the question number (Q1, Q2, …, Q10); given the format and the response, we can then assign the score.

Let’s walk through the example with the first participant. For this participant (Id = 16), we have the following responses:

Question Response
Q1 Slightly Disagree
Q2 Definitely Agree
Q3 Slightly Disagree
Q4 Definitely Disagree
Q5 Slightly Agree
Q6 Slightly Agree
Q7 Slightly Agree
Q8 Definitely Disagree
Q9 Slightly Agree
Q10 Slightly Agree

Note that we have re-formatted the responses so that each response is in a separate row, rather than having all of the responses in a single row, as above. We have reshaped the data from its original wide format to long format. Obviously, this format is called ’long’ because instead of having just one row for each participant, we will now have 10 rows for a total of 66 x 10 = 660. While this format makes it less easy to take the whole dataset in with a single glance, it actually ends up being much easier to deal with, because ’Question’ is a now a single variable whose levels are Q1, Q2, …, Q10, and ’Response’ is also now a single variable. Most functions that you will be working with in R will expect your data to be in long rather than wide format.

Let’s now look up the format for each question:

Question Response QFormat
Q1 Slightly Disagree F
Q2 Definitely Agree R
Q3 Slightly Disagree R
Q4 Definitely Disagree R
Q5 Slightly Agree R
Q6 Slightly Agree R
Q7 Slightly Agree F
Q8 Definitely Disagree F
Q9 Slightly Agree R
Q10 Slightly Agree F

And now that we have the format and the response, we can look up the scores:

Question Response QFormat Score
Q1 Slightly Disagree F 0
Q2 Definitely Agree R 0
Q3 Slightly Disagree R 1
Q4 Definitely Disagree R 1
Q5 Slightly Agree R 0
Q6 Slightly Agree R 0
Q7 Slightly Agree F 1
Q8 Definitely Disagree F 0
Q9 Slightly Agree R 0
Q10 Slightly Agree F 1

Then we just add up the scores, which yields an AQ score of 4 for participant 16. We would then repeat this logic for the remaining 65 participants.Anyone fancy doing this for a big data set?! We will work on using R to get our data into shape in lab 1 and introduce pipes in lab 2 which are a more efficient way of joining functions together.

5.5 In class activity

5.6 Making the computer do the dirty work

Now let’s continue what we started in the prep by hand but now using R to change the data from wide to long format.

First, we will need to create a new working directory and download the data files.

  • responses.csv : survey responses
  • qformats.csv : table linking questions to question formats
  • scoring.csv : table linking formats and responses to score
  • pinfo.csv : participant information (Sex, Age, etc.)

Make a new R script, and make sure to set your working directory where those files can be located and load in tidyverse at the top of your R script then load in the three first files.

Whenever you load in data, it is always a good idea to have a quick look to make sure things make sense (using glimpse(), View() or just clicking on them in the environment). For example, we can view the responses table by just typing the name responses in the console. Do that now.

Now we can use R to transform our data from wide to long. We will use the function gather() function but we will do it for one participant just now. Some info on this gather() function can be found here What does gather() do?.

What this means is that we have created a tibble called rlong using the function gather. We want gather to look in a previously craeted tibble called responses and then pull out the variables ‘Question’ and ‘Response’ for questions 1 to 10. Q1:Q10 is just a convenient shorthand for Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10. We have now created a tibble with 660 observations over 3 variables; 10 observations per 66 participants on 3 variables. Let’s make it simpler just now and only use one participant. We can do that by using filter() which you used last semester and creating a new tibble called rlong_16.

We should now have a tibble with 10 observations of 3 variables.

OK, the next step: match each question with a format F or R. We have this information in the table qformats.

So how do we combine these two tables? We want to match up the tables on the column Question. We can do this using an inner_join(). Try it and see what happens:

Magic! The inner_join(), in effect added a new column to rlong which had the format of each question. Let’s look at the function call again….

The first two arguments to inner_join(), rlong and qformats, are the names of the tibbles that we want to join together. The next argument tells R how to join them, by Question. What the inner join does is match up rows in the two tables where both tables have the same value for the field named in the third argument, “Question”; it then combines the columns from the two tables, copying rows where necessary. To state it more simply, what it does, in effect, is the following: For each row in rlong, it checks the value of the column Question, and looks for rows with the same value in qformats, and then essentially combines all of the other columns in the two tables for these matching rows. If there are unmatching values, the rows get dropped. The inner_join() is one of the most useful and time-saving operations in data wrangling so keep ptracticing as it will keep reappearing time after time.

Now that we have matched up each question with its corresponding format, we can now “look up” the corresponding scores in the scoring table based on the format and the response. So what we need to do is to combine information in our new table, rlong2, and the scoring table. We do this, once again, with an inner join, and we will store the result in a new variable, rscores.

So here we have created a new tibble called rscores using inner_join to join rlong2 and scoring using columns QFormat and Response and you are at a point where you are ready to calculate the AQ scores for participant 16. You have used the function to calculate the total last semester. Can you think where? Look back in your scripts and .Rmd files to remember where you have used this and how you can use it to calculate the total for participant 16.