Learner persona

Kelly Katz

  • General background: Kelly is a medical student, she decided to go for a Master in Clinical Research before going back to the hospital for her clinical rotations. She is very young and enthusiastic. Although she studied some statistics during high school, this is the first time that she is learning how statistics are applied in health sciences and that motivates her.

  • Relevant experience: During her Master Program she learned how to do descriptive statistics and some basic analysis with SPSS, and she feels comfortable with it. She used R and Rstudio thanks to some specific biostats courses, but as she claims, it was “only copy-paste-enter pieces of code and interpret the output”. So far, she did a bit of data wrangling in excel, most of her classes had already clean data ready for analysis.

  • Perceived needs: Kelly has one year to present her thesis, and she will work with observational data of individuals followed over time. She will have to merge and clean all the data files that she requires for her analysis and she is a bit anxious about it. She heard about the reproducibility crisis in research during some seminars, and she feels that the best way to make her data cleaning process transparent is do it in R.

  • Special considerations: Kelly gets super enthusiastic about learning, but she is used to learn from books and a clearly specified curricula. The amount of information available online to learn R overwhelms her and doesn’t let her focus, and that frustrates her.

  • Needs: Kelly needs a clear structure and guide to learn R. A step-by-step tutorial on each topic from the The R4DS book will help her learn the basics for data wrangling and plotting.

Concept map

The class will introduce the rules of tidy data and the key functions to reshape data from wide to long and viceversa. The remaining concepts in the map will be taught in a following module of the extended class. The dplyr package has been studied and used so far, and the students are familiarized with the pipe %>%.

Formative assessments

1. Which of these tables meets the 3 rules of tidy data?

Exercise

Table A
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Table B
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583
Table C
country year cases
Afghanistan 1999 745
Afghanistan 2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999 212258
China 2000 213766

Solution

Correct answer is Table C

Missconceptions:

  • It is frequent to find datasets with repeated measurements over time as in table A, students might have seen and used this type of datasets before and believe that each year represents a new variable, a new property to be measured.

  • Although rate has values that reflect the state of the variable, it is a variable that is not tidy because it contains two numeric values that represent two different variables: cases and population. With those variables, it would be tidier to calculate the rate using mutate, and obtain a numeric value that can be summarised and plotted.

2. We need to transform table 1, to look as table 2.

Exercise

We need to transform Table 1, to look as Table 2. Fill in the blanks, correct if necessary:

survey %>% 
  pivot_____(names_from = "____",
              values_from = "____")
Table 1
student food rate
1 fruit 5
1 vegetable 1
1 icecream 7
2 fruit 5
2 vegetable 4
2 icecream 3
3 fruit 1
3 vegetable 6
3 icecream 9

Table 2

student fruit vegetable icecream
1 5 1 7
2 5 4 3
3 1 6 9

Solution

survey %>%
  pivot_wider(names_from = food,
              values_from = rate) %>% 
  mytable()
student fruit vegetable icecream
1 5 1 7
2 5 4 3
3 1 6 9