Causal inference with competing events

Data cleaning and setup

Note: Most of the code in this tutorial is written in R base, to mimic the original code from the published papers (see References). Dr. Elena Dudukina has a version of this code in R tidyverse here.

Initial steps

  • Load the required packages
library(Hmisc)
library(splines)
library(here)
  • Download the data from the github repository, link here . The example dataset is Byar & Greene prostate cancer data. The data dictionary is available here.

Note: Please note that there are many slightly different versions of this dataset, and the one in this workshop is not identical the one available the hbiostat website.

prostate <- read.csv(here::here("R", "prostate.csv"), sep = ';')
  • Load utility functions, these are located in the github repository. These functions are necessary to calculate hazards and cumulative incidence. Link
source(here::here("R", "utility_functions.R"))

Data cleaning

The goal is to create a data set in long format, so that every participant has as many rows as time-points of follow-up. New variables will be created as indicators of time, outcome, competing event and censoring event.

  • Create a binary variable to indicate all-cause death
prostate$allCause <- prostate$status != "alive"
  • Create a variable with 3 levels indicating the cause of death
prostate$eventType <- as.factor(prostate$status)

levels(prostate$eventType) <-
  list(alive = "alive",
       pdeath = "dead - prostatic ca",
       odeath = c(levels(prostate$eventType)[c(2:5, 7:10)]))
  • Filter data to only include high dose DES (A=1) and placebo (A=0), clean additional covariates
prostate$rx <- as.factor(prostate$rx)
prostRed <- prostate[prostate$rx %in% levels(prostate$rx)[3:4], ]
prostRed$temprx = as.integer(prostRed$rx) - 3
prostRed$rx = abs(1 - prostRed$temprx) # DES is A=1 and placebo A=0
prostRed$eventType = as.integer(prostRed$eventType) - 1 #0 is censoring, 1 is pdeath, 2 is odeath
prostRed$hgBinary <- prostRed$hg < 12
prostRed$ageCat <- cut2(prostRed$age, c(0, 60, 70, 80, 100))
prostRed$normalAct <- prostRed$pf == "normal activity"
prostRed$eventCens <- prostRed$eventType == 0
prostRed$eventProst <- prostRed$eventType == 1
  • Descriptive information
table(prostRed$eventType, prostRed$rx)
##    
##      0  1
##   0 32 32
##   1 37 27
##   2 58 66

Convert data to long format

prostRed$Tstart = -0.01 #Needed for the long format

cutTimes <- c(0:59)

longProstRed <-
  survSplit(
    data = prostRed,
    cut = cutTimes,
    start = "Tstart",
    end = "dtime",
    event = "allCause")

longProstRedCens <-
  survSplit(
    data = prostRed,
    cut = cutTimes,
    start = "Tstart",
    end = "dtime",
    event = "eventCens")

longProstRed$eventCens <- longProstRedCens$eventCens

# Make column for prostate cancer mortality
longProstRed$prostateDeath <-
  longProstRed$allCause == 1 &  longProstRed$eventType == 1

longProstRed$otherDeath <-
  longProstRed$allCause == 1 & longProstRed$eventType == 2
  • Glimpse of data structure, for this matter we select 2 participants with different outcomes.
patno dtime eventCens allCause prostateDeath otherDeath
462 0 0 0 FALSE FALSE
462 1 0 0 FALSE FALSE
462 2 0 0 FALSE FALSE
462 3 0 0 FALSE FALSE
462 4 0 0 FALSE FALSE
462 5 0 0 FALSE FALSE
462 6 0 0 FALSE FALSE
462 7 0 0 FALSE FALSE
462 8 0 0 FALSE FALSE
462 9 0 0 FALSE FALSE
462 10 0 0 FALSE FALSE
462 11 0 0 FALSE FALSE
462 12 0 0 FALSE FALSE
462 13 0 1 FALSE TRUE
480 0 0 0 FALSE FALSE
480 1 0 0 FALSE FALSE
480 2 0 0 FALSE FALSE
480 3 0 0 FALSE FALSE
480 4 0 0 FALSE FALSE
480 5 0 0 FALSE FALSE
480 6 0 0 FALSE FALSE
480 7 0 1 TRUE FALSE
  • To be sure that right risk sets are used in models for this data structure without explicit conditioning, set “future values” of the competing event (D) and the outcome of interest (Y) to missing on a given line if C = 1 on that line and set future values of Y to missing on a line if D = 1 on that line (respects “order” C, D, Y in each interval). This may not be of an issue when there are no “ties” but with month long or bigger intervals intervals there might be.
longProstRed$prostateDeath[longProstRed$eventCens == 1] <- NA

longProstRed$otherDeath[longProstRed$eventCens == 1] <- NA

longProstRed$prostateDeath[longProstRed$otherDeath == 1] <- NA
patno dtime eventCens allCause prostateDeath otherDeath
462 0 0 0 FALSE FALSE
462 1 0 0 FALSE FALSE
462 2 0 0 FALSE FALSE
462 3 0 0 FALSE FALSE
462 4 0 0 FALSE FALSE
462 5 0 0 FALSE FALSE
462 6 0 0 FALSE FALSE
462 7 0 0 FALSE FALSE
462 8 0 0 FALSE FALSE
462 9 0 0 FALSE FALSE
462 10 0 0 FALSE FALSE
462 11 0 0 FALSE FALSE
462 12 0 0 FALSE FALSE
462 13 0 1 NA TRUE
480 0 0 0 FALSE FALSE
480 1 0 0 FALSE FALSE
480 2 0 0 FALSE FALSE
480 3 0 0 FALSE FALSE
480 4 0 0 FALSE FALSE
480 5 0 0 FALSE FALSE
480 6 0 0 FALSE FALSE
480 7 0 1 TRUE FALSE
  • Restrict input data set to records with dtime < K + 1 for fitting pooled over time models
# Before
dim(longProstRed)
## [1] 8722   29
longProstRed <- longProstRed[longProstRed$dtime < length(cutTimes), ]

# After filtering records 

dim(longProstRed)
## [1] 8670   29
  • Create ‘baseline’ - data collected at visit 0
baseline <- longProstRed[longProstRed$dtime == 0, ]

dim(baseline)
## [1] 252  29
  • Save number of subjects
n <- length(unique(longProstRed$patno))

n
## [1] 252

References

  • Young JG, Stensrud MJ, Tchetgen Tchetgen EJ, Hernán MA. A causal framework for classical statistical estimands in failure time settings with competing events. Statistics in Medicine 2020. Paper link Code

  • Stensrud MJ, Young JG, Didelez V, Robins JM, Hernán MA. Separable Effects for Causal Inference in the Presence of Competing Events. Journal of the American Statistical Association 2022. Paper link

  • Rojas-Saunero LP, Young JG, Didelez V, Ikram MA, Swanson SA. Considering questions before methods in dementia research with competing events and causal goals. American Journal of Epidemiology. 2023. Paper link Code

Disclaimer

This workshop was based on the R code from Young JG et al. Statistics in Medicine 2020 and Stensrud MJ et al. Journal of the American Statistical Association 2022. Since both papers use the same data source and have similar procedures of data cleaning, we have unified several preliminary steps, specially in the Data cleaning and setup section, for better readability and comprehension in terms of the workshop. However, this means that results will not be identical to the results of these manuscripts.

Some of the differences that arise in this version, compared to the code by Stensrud et al. are:

  • DES is A=1 and placebo A=0

  • cutTimes is 60, instead of 101

  • prostRed$hgBinary <- prostRed$hg < 12 instead of prostRed$hg < 10

  • Since we include the following code:

longProstRed$prostateDeath[longProstRed$eventCens == 1] <- NA

longProstRed$otherDeath[longProstRed$eventCens == 1] <- NA

longProstRed$prostateDeath[longProstRed$otherDeath == 1] <- NA

ipw_d <- cum_pred_O_1 / cum_pred_O_0 was used instead of ipw_d[!t0] <- cum_pred_O_1[!t0] / cum_pred_O_0[!t0] in the section of “Separable effects using IPW”.

  • Several variables names have been modified compared to the code by Young et al. and by Stensrud et al.

  • Hand calculations for the interaction terms between the exposure and time of follow-up were removed and indicators were included in the models.