Causal inference with competing events
Data cleaning and setup
Note: Most of the code in this tutorial is written in R base, to mimic the original code from the published papers (see References). Dr. Elena Dudukina has a version of this code in R tidyverse here.
Initial steps
- Load the required packages
library(Hmisc)
library(splines)
library(here)
- Download the data from the github repository, link here . The example dataset is Byar & Greene prostate cancer data. The data dictionary is available here.
Note: Please note that there are many slightly different versions of this dataset, and the one in this workshop is not identical the one available the hbiostat website.
<- read.csv(here::here("R", "prostate.csv"), sep = ';') prostate
- Load utility functions, these are located in the github repository. These functions are necessary to calculate hazards and cumulative incidence. Link
source(here::here("R", "utility_functions.R"))
Data cleaning
The goal is to create a data set in long format, so that every participant has as many rows as time-points of follow-up. New variables will be created as indicators of time, outcome, competing event and censoring event.
- Create a binary variable to indicate all-cause death
$allCause <- prostate$status != "alive" prostate
- Create a variable with 3 levels indicating the cause of death
$eventType <- as.factor(prostate$status)
prostate
levels(prostate$eventType) <-
list(alive = "alive",
pdeath = "dead - prostatic ca",
odeath = c(levels(prostate$eventType)[c(2:5, 7:10)]))
- Filter data to only include high dose DES (A=1) and placebo (A=0), clean additional covariates
$rx <- as.factor(prostate$rx)
prostate<- prostate[prostate$rx %in% levels(prostate$rx)[3:4], ]
prostRed $temprx = as.integer(prostRed$rx) - 3
prostRed$rx = abs(1 - prostRed$temprx) # DES is A=1 and placebo A=0
prostRed$eventType = as.integer(prostRed$eventType) - 1 #0 is censoring, 1 is pdeath, 2 is odeath
prostRed$hgBinary <- prostRed$hg < 12
prostRed$ageCat <- cut2(prostRed$age, c(0, 60, 70, 80, 100))
prostRed$normalAct <- prostRed$pf == "normal activity"
prostRed$eventCens <- prostRed$eventType == 0
prostRed$eventProst <- prostRed$eventType == 1 prostRed
- Descriptive information
table(prostRed$eventType, prostRed$rx)
##
## 0 1
## 0 32 32
## 1 37 27
## 2 58 66
Convert data to long format
$Tstart = -0.01 #Needed for the long format
prostRed
<- c(0:59)
cutTimes
<-
longProstRed survSplit(
data = prostRed,
cut = cutTimes,
start = "Tstart",
end = "dtime",
event = "allCause")
<-
longProstRedCens survSplit(
data = prostRed,
cut = cutTimes,
start = "Tstart",
end = "dtime",
event = "eventCens")
$eventCens <- longProstRedCens$eventCens
longProstRed
# Make column for prostate cancer mortality
$prostateDeath <-
longProstRed$allCause == 1 & longProstRed$eventType == 1
longProstRed
$otherDeath <-
longProstRed$allCause == 1 & longProstRed$eventType == 2 longProstRed
- Glimpse of data structure, for this matter we select 2 participants with different outcomes.
patno | dtime | eventCens | allCause | prostateDeath | otherDeath |
---|---|---|---|---|---|
462 | 0 | 0 | 0 | FALSE | FALSE |
462 | 1 | 0 | 0 | FALSE | FALSE |
462 | 2 | 0 | 0 | FALSE | FALSE |
462 | 3 | 0 | 0 | FALSE | FALSE |
462 | 4 | 0 | 0 | FALSE | FALSE |
462 | 5 | 0 | 0 | FALSE | FALSE |
462 | 6 | 0 | 0 | FALSE | FALSE |
462 | 7 | 0 | 0 | FALSE | FALSE |
462 | 8 | 0 | 0 | FALSE | FALSE |
462 | 9 | 0 | 0 | FALSE | FALSE |
462 | 10 | 0 | 0 | FALSE | FALSE |
462 | 11 | 0 | 0 | FALSE | FALSE |
462 | 12 | 0 | 0 | FALSE | FALSE |
462 | 13 | 0 | 1 | FALSE | TRUE |
480 | 0 | 0 | 0 | FALSE | FALSE |
480 | 1 | 0 | 0 | FALSE | FALSE |
480 | 2 | 0 | 0 | FALSE | FALSE |
480 | 3 | 0 | 0 | FALSE | FALSE |
480 | 4 | 0 | 0 | FALSE | FALSE |
480 | 5 | 0 | 0 | FALSE | FALSE |
480 | 6 | 0 | 0 | FALSE | FALSE |
480 | 7 | 0 | 1 | TRUE | FALSE |
- To be sure that right risk sets are used in models for this data structure without explicit conditioning, set “future values” of the competing event (D) and the outcome of interest (Y) to missing on a given line if C = 1 on that line and set future values of Y to missing on a line if D = 1 on that line (respects “order” C, D, Y in each interval). This may not be of an issue when there are no “ties” but with month long or bigger intervals intervals there might be.
$prostateDeath[longProstRed$eventCens == 1] <- NA
longProstRed
$otherDeath[longProstRed$eventCens == 1] <- NA
longProstRed
$prostateDeath[longProstRed$otherDeath == 1] <- NA longProstRed
patno | dtime | eventCens | allCause | prostateDeath | otherDeath |
---|---|---|---|---|---|
462 | 0 | 0 | 0 | FALSE | FALSE |
462 | 1 | 0 | 0 | FALSE | FALSE |
462 | 2 | 0 | 0 | FALSE | FALSE |
462 | 3 | 0 | 0 | FALSE | FALSE |
462 | 4 | 0 | 0 | FALSE | FALSE |
462 | 5 | 0 | 0 | FALSE | FALSE |
462 | 6 | 0 | 0 | FALSE | FALSE |
462 | 7 | 0 | 0 | FALSE | FALSE |
462 | 8 | 0 | 0 | FALSE | FALSE |
462 | 9 | 0 | 0 | FALSE | FALSE |
462 | 10 | 0 | 0 | FALSE | FALSE |
462 | 11 | 0 | 0 | FALSE | FALSE |
462 | 12 | 0 | 0 | FALSE | FALSE |
462 | 13 | 0 | 1 | NA | TRUE |
480 | 0 | 0 | 0 | FALSE | FALSE |
480 | 1 | 0 | 0 | FALSE | FALSE |
480 | 2 | 0 | 0 | FALSE | FALSE |
480 | 3 | 0 | 0 | FALSE | FALSE |
480 | 4 | 0 | 0 | FALSE | FALSE |
480 | 5 | 0 | 0 | FALSE | FALSE |
480 | 6 | 0 | 0 | FALSE | FALSE |
480 | 7 | 0 | 1 | TRUE | FALSE |
- Restrict input data set to records with dtime < K + 1 for fitting pooled over time models
# Before
dim(longProstRed)
## [1] 8722 29
<- longProstRed[longProstRed$dtime < length(cutTimes), ]
longProstRed
# After filtering records
dim(longProstRed)
## [1] 8670 29
- Create ‘baseline’ - data collected at visit 0
<- longProstRed[longProstRed$dtime == 0, ]
baseline
dim(baseline)
## [1] 252 29
- Save number of subjects
<- length(unique(longProstRed$patno))
n
n
## [1] 252
Next steps
References
Young JG, Stensrud MJ, Tchetgen Tchetgen EJ, Hernán MA. A causal framework for classical statistical estimands in failure time settings with competing events. Statistics in Medicine 2020. Paper link Code
Stensrud MJ, Young JG, Didelez V, Robins JM, Hernán MA. Separable Effects for Causal Inference in the Presence of Competing Events. Journal of the American Statistical Association 2022. Paper link
Rojas-Saunero LP, Young JG, Didelez V, Ikram MA, Swanson SA. Considering questions before methods in dementia research with competing events and causal goals. American Journal of Epidemiology. 2023. Paper link Code
Disclaimer
This workshop was based on the R code
from Young JG
et al. Statistics in Medicine 2020 and Stensrud
MJ et al. Journal of the American Statistical Association 2022.
Since both papers use the same data source and have similar procedures
of data cleaning, we have unified several preliminary steps, specially
in the Data cleaning and setup
section, for better
readability and comprehension in terms of the workshop. However, this
means that results will not be identical to the results of these
manuscripts.
Some of the differences that arise in this version, compared to the code by Stensrud et al. are:
DES is A=1 and placebo A=0
cutTimes
is 60, instead of 101prostRed$hgBinary <- prostRed$hg < 12
instead ofprostRed$hg < 10
Since we include the following code:
longProstRed$prostateDeath[longProstRed$eventCens == 1] <- NA
longProstRed$otherDeath[longProstRed$eventCens == 1] <- NA
longProstRed$prostateDeath[longProstRed$otherDeath == 1] <- NA
ipw_d <- cum_pred_O_1 / cum_pred_O_0
was used instead
of ipw_d[!t0] <- cum_pred_O_1[!t0] / cum_pred_O_0[!t0]
in the section of “Separable effects using IPW”.
Several variables names have been modified compared to the code by Young et al. and by Stensrud et al.
Hand calculations for the interaction terms between the exposure and time of follow-up were removed and indicators were included in the models.