Chapter 3 Lab 3
In this lab we move on from reading data in, joining tibbles and selecting variables of interest to becoming familiar with the Wickham 6 and the functionality of tidyverse. Practice is an important part of PsyTeachR so in our level 1 program we ask students to repeat tasks across the semester in the pre-class, in-class and assessments. This principle is carried over into the student’s independent study practices. We encourage students to do prep independently but also to come along to the practice sessions where space is open for students to work on lab tasks and assessments independently or discuss in groups with GTAs present to guide and support if needed. This is an important part of our community building and creating a safe space to practice, make mistakes and develop skills. This lab also contains the first of a selection of short informal videos as support materials. For this lab, the course lead talks through the verbs demonstrating their functions with data in real time. As these videos were specific to our level 1 cohort we haven’t shared them here but should emphasise that they received very positive feedback from the students who liked being able to return to them as they practiced their tidyverse skills.
3.1 Pre-class activity
- Watch Heather’s video on intro to tidyverse
- Work through the tidyverse book which talks you through the purpose of tidyverse using examples on a dataset called
babynames
. This package is installed on the lab computers but you can alsoinstall.packages("babynames")
on your own personal computer if you want to prep for the lab at home.
Tidyverse (https://www.tidyverse.org/) is a collection of R packages created by world-famous data scientist Hadley Wickham.
Tidyverse contains six core packages: dplyr , tidyr , readr , purrr , ggplot2 , and tibble. If you’ve ever typed library(tidyverse) into R, you will have seen that it loads in all of these packages in one go. Within these six core packages, you should be able to find everything you need to analyse your data.
In this lab, we are going to focus on the dplyr package, which contains six important functions:
select()
Include or exclude certain variables (columns)filter()
Include or exclude certain observations (rows)mutate()
Create new variables (columns)arrange()
Change the order of observations (rows)group_by()
Organize the observations into groupssummarise()
Derive aggregate variables for groups of observations
These six functions are known as ’single table verbs’ because they only operate on one table at a time. Later in the course you will learn two-table verbs that you can use to merge tables together. Although the operations of these functions may seem very simplistic, it’s amazing what you can accomplish when you string them together: Hadley Wickham has claimed that 90% of data analysis can be reduced to the operations described by these six functions.
Remember these key points as you work through the data activities on this course
- RStudio is a tool that enables you to become confident and competent working with data
- This is essential as a psychologist as you need to know that the data findings are based on are reliable and valid. Imagine delivering a treatment that was based on research where the analyst was not confident or competent?
- Think of a skill you nailed the first time you did it. Not so easy? You will learn to work with data by repeating these basic skills over and over. It’s new and will take time. Give yourself credit for trying and don’t stress if it doesn’t work right away. Use the resources we give, that are being provided online among the open science community, and ask questions in class and on the forums.
3.2 In-class activities
OK, so what is our first step? Yup, thats right, we need to tell RStudio where to find all the files we need and pull them in ready for work. Let’s go back to lab 2 and use our previous script to remember how to load the package, read in the data, and join the two data files together:
3.2.1 Activity 1: Packages and data
Open up tidyverse and read in the data
library(tidyverse)
dat <- read_csv ('files/ahi-cesd.csv')
pinfo <- read_csv('files/participant-info.csv')
all_dat <- inner_join(dat, pinfo, by='id', 'intervention')
Now lets get busy using our tidyverse verbs…
3.2.2 Activity 2: Select
Select the columns all_dat, ahiTotal, cesdTotal, sex, age, educ, income, occasion, elapsed.days from the data and create a variable called summarydata
.
3.2.3 Activity 3: Arrange
Arrange the data in the variable created above (summarydata
) by ahiTotal with lowest score first.
3.2.4 Activity 4: Filter
Filter the data ahi_desc
by taking out those who are over 65 years of age.
3.2.5 Activity 5: Summarise
Then, use summarise to create a new variable data_median
, which calculates the median ahiTotal score in this grouped data and assign it a table head called median_score
.
3.2.6 Activity 6: Group_by
Group the data stored in the variable age_65max
by sex, and store it in data_sex
then use mutate to create a new column called Happiness which categorises participants based on whether they score above the median ahiTotal score for females.