Survival Analysis Part 2
Survival Analysis Part 2
In the first Survival Analysis topic note we assessed survival probability curves and if two different survival curves were significantly different. Instead of survival probability, here we will discuss hazard probability, or the probability that an individual has an event at some time \(t\). Survival probability on the other hand, refers to the opposite - that the individual will survive to time \(t\). To define the hazard probability we will use the following function:
Explanation of Terms
- \(t\) the survival time
- \(h(t)\) the hazard probability
- \(n\) different number of covariates \(x\) (so variables like age, sex, etc.)
- \(b\) are the coefficients of the impact of each of these covariates
- \(h_0(t)\) would be the baseline hazard
These \(b\) coefficients can also be referred to as hazard ratios. And they can be interpreted like so:
Hazard Ratio Interpretation
- Hazard Ratio < 1: there is a decrease in the hazard
- HHazard RatioR > 1: there is a increase in the hazard
- Hazard Ratio = 1: there is no effect on the hazard
Pre-Processing
We will start as we did in the first Survival Analysis topic note and load our libraries, data, and add a censor status column:
# load the libraries
.libPaths(c("/cluster/tufts/hpc/tools/R/4.0.0"))
library("survival")
library("survminer")
# load in patient meta data
meta <- read.csv(
file = "data/gbm_cptac_2021/data_clinical_patient.txt",
skip=4,
header = T,
sep = "\t"
)
# create a censored status column
# 1 being they are censored
# 2 being they are not censored
meta$status <- ifelse(
(meta$LOST_TO_FOLLOW_UP == "Yes" & meta$VITAL_STATUS == "Living"),
1,
2
)
Cox Proportional-Hazards Model
# run the Cox Proportional-Hazards Model on our data
# with a few covariates and get a summary
cox <- coxph(Surv(PATH_DIAG_TO_DEATH_DAYS, status) ~ SEX + AGE + BMI,
data = meta)
summary(cox)
Call:
coxph(formula = Surv(PATH_DIAG_TO_DEATH_DAYS, status) ~ SEX +
AGE + BMI, data = meta)
n= 62, number of events= 62
(37 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
SEXMale -0.300913 0.740143 0.266021 -1.131 0.258
AGE -0.003842 0.996166 0.011765 -0.327 0.744
BMI -0.006180 0.993839 0.025223 -0.245 0.806
exp(coef) exp(-coef) lower .95 upper .95
SEXMale 0.7401 1.351 0.4394 1.247
AGE 0.9962 1.004 0.9735 1.019
BMI 0.9938 1.006 0.9459 1.044
Concordance= 0.564 (se = 0.041 )
Likelihood ratio test= 1.52 on 3 df, p=0.7
Wald test = 1.55 on 3 df, p=0.7
Score (logrank) test = 1.56 on 3 df, p=0.7
What Does this mean?
- Here we note that all the coefficients for SEX(being male), AGE, and BMI have slight negative coefficients or hazard ratios
- We also note that no p-value is below 0.05 indicating that any observed effect on our hazard (death) could be due to chance
- We also note that sex has the most significant p-value and has the largest magnitude of all the coefficients
Assumptions
Before accepting the results of this model we should discuss the assumptions of the Cox Proportional-Hazards Model
Assumptions of the Cox Proportional-Hazards Model
- the hazard curves different groups should be proportional and not cross each other
- There are no outliers in our data
- The relationship between the log hazard and the covariates must be linear
Testing Proportionality in Hazards
# test proportional hazards assumption
cox_test <- cox.zph(cox)
ggcoxzph(cox_test)
What does this mean?
- Here we see that the global test is above 0.05, indicating we have not violated the proportional hazards assumption
Testing for Outliers
# test the outlier assumption
ggcoxdiagnostics(cox,
type = "dfbeta",
linear.predictions = FALSE,
ggtheme = theme_bw())
What does this mean?
- Here we see that there are a few outliers as the patterns of the residuals, especially for BMI, are not spread equally around 0.
Testing for Linearity
NEEDS MORE EXPLANATION HERE
# test for non-linearity for AGE
ggcoxfunctional(Surv(PATH_DIAG_TO_DEATH_DAYS, status) ~ AGE + log(AGE) + sqrt(AGE),
data = meta)