Causal inference with competing events

Data cleaning and setup

Note: Most of the code in this tutorial is written in R base, to mimic the original code from the published papers (see References). Dr. Elena Dudukina has a version of this code in R tidyverse here.

Initial steps

Load the required packages

library(Hmisc)
library(splines)
library(here)

Download the data from the github repository, link here . The example dataset is Byar & Greene prostate cancer data. The data dictionary is available here.

Note: Please note that there are many slightly different versions of this dataset, and the one in this workshop is not identical the one available the hbiostat website.

prostate <- read.csv(here::here("R", "prostate.csv"), sep = ';')

Load utility functions, these are located in the github repository. These functions are necessary to calculate hazards and cumulative incidence. Link

source(here::here("R", "utility_functions.R"))

Data cleaning

The goal is to create a data set in long format, so that every participant has as many rows as time-points of follow-up. New variables will be created as indicators of time, outcome, competing event and censoring event.

Create a binary variable to indicate all-cause death

prostate$allCause <- prostate$status != "alive"

Create a variable with 3 levels indicating the cause of death

prostate$eventType <- as.factor(prostate$status)

levels(prostate$eventType) <-
  list(alive = "alive",
       pdeath = "dead - prostatic ca",
       odeath = c(levels(prostate$eventType)[c(2:5, 7:10)]))

Filter data to only include high dose DES (A=1) and placebo (A=0), clean additional covariates

prostate$rx <- as.factor(prostate$rx)
prostRed <- prostate[prostate$rx %in% levels(prostate$rx)[3:4], ]
prostRed$temprx = as.integer(prostRed$rx) - 3
prostRed$rx = abs(1 - prostRed$temprx) # DES is A=1 and placebo A=0
prostRed$eventType = as.integer(prostRed$eventType) - 1 #0 is censoring, 1 is pdeath, 2 is odeath
prostRed$hgBinary <- prostRed$hg < 12
prostRed$ageCat <- cut2(prostRed$age, c(0, 60, 70, 80, 100))
prostRed$normalAct <- prostRed$pf == "normal activity"
prostRed$eventCens <- prostRed$eventType == 0
prostRed$eventProst <- prostRed$eventType == 1

Descriptive information

table(prostRed$eventType, prostRed$rx)

##    
##      0  1
##   0 32 32
##   1 37 27
##   2 58 66

Convert data to long format

prostRed$Tstart = -0.01 #Needed for the long format

cutTimes <- c(0:59)

longProstRed <-
  survSplit(
    data = prostRed,
    cut = cutTimes,
    start = "Tstart",
    end = "dtime",
    event = "allCause")

longProstRedCens <-
  survSplit(
    data = prostRed,
    cut = cutTimes,
    start = "Tstart",
    end = "dtime",
    event = "eventCens")

longProstRed$eventCens <- longProstRedCens$eventCens

# Make column for prostate cancer mortality
longProstRed$prostateDeath <-
  longProstRed$allCause == 1 &  longProstRed$eventType == 1

longProstRed$otherDeath <-
  longProstRed$allCause == 1 & longProstRed$eventType == 2

Glimpse of data structure, for this matter we select 2 participants with different outcomes.

patno	dtime	allCause	prostateDeath	otherDeath
462	0	0	FALSE	FALSE
462	1	0	FALSE	FALSE
462	2	0	FALSE	FALSE
462	3	0	FALSE	FALSE
462	4	0	FALSE	FALSE
462	5	0	FALSE	FALSE
462	6	0	FALSE	FALSE
462	7	0	FALSE	FALSE
462	8	0	FALSE	FALSE
462	9	0	FALSE	FALSE
462	10	0	FALSE	FALSE
462	11	0	FALSE	FALSE
462	12	0	FALSE	FALSE
462	13	1	FALSE	TRUE
480	0	0	FALSE	FALSE
480	1	0	FALSE	FALSE
480	2	0	FALSE	FALSE
480	3	0	FALSE	FALSE
480	4	0	FALSE	FALSE
480	5	0	FALSE	FALSE
480	6	0	FALSE	FALSE
480	7	1	TRUE	FALSE

To be sure that right risk sets are used in models for this data structure without explicit conditioning, set “future values” of the competing event (D) and the outcome of interest (Y) to missing on a given line if C = 1 on that line and set future values of Y to missing on a line if D = 1 on that line (respects “order” C, D, Y in each interval). This may not be of an issue when there are no “ties” but with month long or bigger intervals intervals there might be.

longProstRed$prostateDeath[longProstRed$eventCens == 1] <- NA

longProstRed$otherDeath[longProstRed$eventCens == 1] <- NA

longProstRed$prostateDeath[longProstRed$otherDeath == 1] <- NA

patno	dtime	allCause	prostateDeath	otherDeath
462	0	0	FALSE	FALSE
462	1	0	FALSE	FALSE
462	2	0	FALSE	FALSE
462	3	0	FALSE	FALSE
462	4	0	FALSE	FALSE
462	5	0	FALSE	FALSE
462	6	0	FALSE	FALSE
462	7	0	FALSE	FALSE
462	8	0	FALSE	FALSE
462	9	0	FALSE	FALSE
462	10	0	FALSE	FALSE
462	11	0	FALSE	FALSE
462	12	0	FALSE	FALSE
462	13	1	NA	TRUE
480	0	0	FALSE	FALSE
480	1	0	FALSE	FALSE
480	2	0	FALSE	FALSE
480	3	0	FALSE	FALSE
480	4	0	FALSE	FALSE
480	5	0	FALSE	FALSE
480	6	0	FALSE	FALSE
480	7	1	TRUE	FALSE

Restrict input data set to records with dtime < K + 1 for fitting pooled over time models

# Before
dim(longProstRed)

## [1] 8722   29

longProstRed <- longProstRed[longProstRed$dtime < length(cutTimes), ]

# After filtering records 

dim(longProstRed)

## [1] 8670   29

Create ‘baseline’ - data collected at visit 0

baseline <- longProstRed[longProstRed$dtime == 0, ]

dim(baseline)

## [1] 252  29

Save number of subjects

n <- length(unique(longProstRed$patno))

n

## [1] 252

References

Young JG, Stensrud MJ, Tchetgen Tchetgen EJ, Hernán MA. A causal framework for classical statistical estimands in failure time settings with competing events. Statistics in Medicine 2020. Paper link Code
Stensrud MJ, Young JG, Didelez V, Robins JM, Hernán MA. Separable Effects for Causal Inference in the Presence of Competing Events. Journal of the American Statistical Association 2022. Paper link
Rojas-Saunero LP, Young JG, Didelez V, Ikram MA, Swanson SA. Considering questions before methods in dementia research with competing events and causal goals. American Journal of Epidemiology. 2023. Paper link Code

Disclaimer

This workshop was based on the R code from Young JG et al. Statistics in Medicine 2020 and Stensrud MJ et al. Journal of the American Statistical Association 2022. Since both papers use the same data source and have similar procedures of data cleaning, we have unified several preliminary steps, specially in the Data cleaning and setup section, for better readability and comprehension in terms of the workshop. However, this means that results will not be identical to the results of these manuscripts.

Some of the differences that arise in this version, compared to the code by Stensrud et al. are:

DES is A=1 and placebo A=0
cutTimes is 60, instead of 101
prostRed$hgBinary <- prostRed$hg < 12 instead of prostRed$hg < 10
Since we include the following code:

longProstRed$prostateDeath[longProstRed$eventCens == 1] <- NA

longProstRed$otherDeath[longProstRed$eventCens == 1] <- NA

longProstRed$prostateDeath[longProstRed$otherDeath == 1] <- NA

ipw_d <- cum_pred_O_1 / cum_pred_O_0 was used instead of ipw_d[!t0] <- cum_pred_O_1[!t0] / cum_pred_O_0[!t0] in the section of “Separable effects using IPW”.

Several variables names have been modified compared to the code by Young et al. and by Stensrud et al.
Hand calculations for the interaction terms between the exposure and time of follow-up were removed and indicators were included in the models.

Causal inference with competing events