Survival Forest models
The Ensemble models that use decision trees as its base learners can be extended to take into account censored datasets. These types of models can be regrouped under the name Survival Forest models. PySurvival contains 3 types of Survival Forest models:
- Random Survival Forest model (
- Extremely Randomized (Extra) Survival Trees model (
- Conditional Survival Forest model (
These models have been adapted to python from the package ranger, which is a fast implementation of random forests in C++.
Ishwaran et al. provides a general framework that can be used to describe the underlying algorithm that powers the Survival Forest models:
Draw random samples of the same size from the original dataset with replacement. The samples that are not drawn are said to be out-of-bag (OOB).
Grow a survival tree on each of the samples.
a. At each node, select a random subset of predictor variables and find the best predictor and splitting value that provide two subsets (the daughter nodes) which maximizes the difference in the objective function.
b. Repeat a. recursively on each daughter node until a stopping criterion is met.
Calculate a cumulative hazard function (CHF) for each tree and average over all CHFs for the B trees to obtain the ensemble CHF.
Compute the prediction error for the ensemble CHF using only the OOB data.
All the Survival Forest models in PySurvival use this framework as the basis of the model fitting algorithm. The objective function is the main element that can differentiate then from one another.
At each node, we choose a predictor from a subset of randomly selected predictor variables and a split value . is one of the unique values of
We assign each individual sample to either the right node, if or left daughter node if . Then we calculate the value of the log rank test such that:
- : Daughter node,
- : Number of events at time in daughter node .
- : Number of units that experienced an event or are at risk at time in daughter node .
- : Number of events at time , so
- : Number of units that experienced an event or at risk at time , so
We loop through every and until we find and that satisfy for every and .
Extra Survival Trees models use the same objective function as the Random Survival Forest models. But for each predictor , instead of using the unique values of to find the best split value , we use values drawn from a uniform distribution over the interval .
Conditional Survival Forest models are constructed in a way that is a bit different from Random Survival Forest models:
The objective function is given by testing the null hypothesis that there is independence between the response and the predictor. To do so, for each predictor variable , compute the logrank score test statistic and its associated p-value:
Let's consider observations . We will assume the predictor has been ordered so that . With , we compute the logrank scores such that :
For a predictor and split value , and within the right node (), we can now calculate :
- the sum of all scores
- its expectation with and
- its variance with
We can obtain the score test statistic and look for such that .
Finally, we compute the p-value associated with .
At each node, only for the predictors whose associated p-value is smaller than a specified value , the predictor with the smallest p-value is selected as splitting candidate. However, if no predictor can be used then no split is performed.
- Ishwaran H, Kogalur U, Blackstone E, Lauer M. Random survival forests. The Annals of Applied Statistics. 2008; 2(3):841–860.
- ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R
- Weathers, Brandon and Cutler, Richard Dr., "Comparison of Survival Curves Between Cox Proportional Hazards, Random Forests, and Conditional Inference Forests in Survival Analysis" (2017). All Graduate Plan B and other Reports. 927.
- Wright, Marvin N., Theresa Dankowski and Andreas Ziegler. "Random forests for survival analysis using maximally selected rank statistics."" Statistics in medicine 36 8 (2017): 1272-1284.
- Geurts, Pierre & Ernst, Damien & Wehenkel, Louis. (2006). Extremely Randomized Trees. Machine Learning. 63. 3-42. 10.1007/s10994-006-6226-1.