Kelly Katz
General background: Kelly is a medical student, she decided to go for a Master in Clinical Research before going back to the hospital for her clinical rotations. She is very young and enthusiastic. Although she studied some statistics during high school, this is the first time that she is learning how statistics are applied in health sciences and that motivates her.
Relevant experience: During her Master Program she learned how to do descriptive statistics and some basic analysis with SPSS, and she feels comfortable with it. She used R and Rstudio thanks to some specific biostats courses, but as she claims, it was “only copy-paste-enter pieces of code and interpret the output”. So far, she did a bit of data wrangling in excel, most of her classes had already clean data ready for analysis.
Perceived needs: Kelly has one year to present her thesis, and she will work with observational data of individuals followed over time. She will have to merge and clean all the data files that she requires for her analysis and she is a bit anxious about it. She heard about the reproducibility crisis in research during some seminars, and she feels that the best way to make her data cleaning process transparent is do it in R.
Special considerations: Kelly gets super enthusiastic about learning, but she is used to learn from books and a clearly specified curricula. The amount of information available online to learn R overwhelms her and doesn’t let her focus, and that frustrates her.
Needs: Kelly needs a clear structure and guide to learn R. A step-by-step tutorial on each topic from the The R4DS book will help her learn the basics for data wrangling and plotting.
The class will introduce the rules of tidy data and the key functions to reshape data from wide to long and viceversa. The remaining concepts in the map will be taught in a following module of the extended class. The dplyr
package has been studied and used so far, and the students are familiarized with the pipe %>%
.
country | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
country | year | cases |
---|---|---|
Afghanistan | 1999 | 745 |
Afghanistan | 2000 | 2666 |
Brazil | 1999 | 37737 |
Brazil | 2000 | 80488 |
China | 1999 | 212258 |
China | 2000 | 213766 |
Correct answer is Table C
Missconceptions:
It is frequent to find datasets with repeated measurements over time as in table A, students might have seen and used this type of datasets before and believe that each year represents a new variable, a new property to be measured.
Although rate
has values that reflect the state of the variable, it is a variable that is not tidy because it contains two numeric values that represent two different variables: cases
and population
. With those variables, it would be tidier to calculate the rate
using mutate
, and obtain a numeric value that can be summarised and plotted.
We need to transform Table 1, to look as Table 2. Fill in the blanks, correct if necessary:
survey %>%
pivot_____(names_from = "____",
values_from = "____")
student | food | rate |
---|---|---|
1 | fruit | 5 |
1 | vegetable | 1 |
1 | icecream | 7 |
2 | fruit | 5 |
2 | vegetable | 4 |
2 | icecream | 3 |
3 | fruit | 1 |
3 | vegetable | 6 |
3 | icecream | 9 |
Table 2
student | fruit | vegetable | icecream |
---|---|---|---|
1 | 5 | 1 | 7 |
2 | 5 | 4 | 3 |
3 | 1 | 6 | 9 |
survey %>%
pivot_wider(names_from = food,
values_from = rate) %>%
mytable()
student | fruit | vegetable | icecream |
---|---|---|---|
1 | 5 | 1 | 7 |
2 | 5 | 4 | 3 |
3 | 1 | 6 | 9 |