Skip to contents

Gradient Boosting for survival analysis using ensemble methods that combine multiple weak learners (typically shallow trees) to create strong predictive models. This implementation supports both traditional gradient boosting (gbm) and modern component-wise boosting (mboost) approaches. The method is particularly effective for high-dimensional survival data and complex non-linear relationships between predictors and survival outcomes. Features include automatic variable selection, handling of mixed-type predictors, built-in cross-validation for optimal stopping, and robust performance with noisy data. Especially suitable for biomarker discovery, prognostic modeling, and complex survival prediction tasks.

Usage

gradientboosting(
  data,
  time,
  event,
  predictors,
  strata,
  algorithm = "mboost",
  n_trees = 100,
  learning_rate = 0.1,
  max_depth = 3,
  min_node_size = 10,
  subsample = 1,
  cv_folds = 5,
  early_stopping = TRUE,
  patience = 10,
  reg_alpha = 0,
  reg_lambda = 1,
  variable_selection = TRUE,
  importance_threshold = 0.01,
  show_convergence = TRUE,
  show_importance = TRUE,
  show_predictions = FALSE,
  plot_convergence = TRUE,
  plot_importance = TRUE,
  plot_partial = FALSE,
  plot_survival = FALSE,
  interaction_depth = 1,
  bag_fraction = 0.5,
  random_seed = 123
)

Arguments

data

The data as a data frame.

time

Time to event variable (numeric). For right-censored data, this is the time from study entry to event or censoring.

event

Event indicator variable. For survival analysis: 0 = censored, 1 = event. For competing risks: 0 = censored, 1+ = different event types.

predictors

Variables to use for boosting. Can include numeric, ordinal, and nominal variables. The algorithm automatically handles mixed-type predictors and performs variable selection.

strata

Optional stratification variable for stratified survival analysis. Creates separate baseline hazards for each stratum.

algorithm

Boosting algorithm to use. mboost provides component-wise boosting with statistical framework, gbm offers traditional gradient boosting, xgboost provides extreme gradient boosting with advanced regularization.

n_trees

Number of boosting iterations (trees). More trees can improve performance but may lead to overfitting. Use cross-validation to determine optimal value.

learning_rate

Learning rate (shrinkage parameter). Lower values require more trees but often provide better generalization. Typical values: 0.01-0.3.

max_depth

Maximum depth of individual trees. Shallow trees (1-6) are typically sufficient for boosting. Deeper trees may capture interactions but increase overfitting risk.

min_node_size

Minimum number of observations in terminal nodes. Higher values create simpler trees and reduce overfitting.

subsample

Fraction of observations used for each tree. Values < 1.0 introduce stochasticity and can improve generalization (stochastic gradient boosting).

cv_folds

Number of folds for cross-validation to determine optimal number of trees and prevent overfitting. Set to 0 to disable cross-validation.

early_stopping

Use early stopping based on cross-validation to prevent overfitting. Stops training when validation error stops improving.

patience

Number of iterations without improvement before early stopping. Higher values allow more exploration but may lead to overfitting.

reg_alpha

L1 (Lasso) regularization parameter for XGBoost. Higher values increase sparsity by driving coefficients to zero.

reg_lambda

L2 (Ridge) regularization parameter for XGBoost. Higher values reduce overfitting by penalizing large coefficients.

variable_selection

Perform automatic variable selection during boosting. Variables with low importance are excluded from final model.

importance_threshold

Minimum relative importance for variable inclusion in final model. Variables below this threshold are excluded.

show_convergence

Display convergence diagnostics including training and validation error curves, optimal stopping point, and convergence statistics.

show_importance

Calculate and display variable importance measures based on the frequency and improvement of splits.

show_predictions

Generate survival predictions and risk scores for the training data. Useful for model evaluation and risk stratification.

plot_convergence

Plot training and validation error curves showing convergence behavior and optimal stopping point.

plot_importance

Generate variable importance plot showing relative importance of predictors in the boosted model.

plot_partial

Create partial dependence plots for top variables showing marginal effect on survival hazard.

plot_survival

Plot Kaplan-Meier curves for risk groups defined by boosted model predictions with statistical comparisons.

interaction_depth

Maximum order of variable interactions to consider. Higher values capture complex interactions but increase computational complexity.

bag_fraction

Fraction of variables randomly selected for each tree (for gbm). Introduces randomness and can improve generalization.

random_seed

Random seed for reproducible results. Change to get different random splits and variable selections.

Value

A results object containing:

results$todoa html
results$modelSummarya table
results$convergenceStatsa table
results$variableImportancea table
results$predictionsa table
results$convergencePlotan image
results$importancePlotan image
results$partialPlotsan image
results$survivalPlotan image

Tables can be converted to data frames with asDF or as.data.frame. For example:

results$modelSummary$asDF

as.data.frame(results$modelSummary)

Examples

result <- gradientboosting(
    data = mydata,
    time = "time_to_event",
    event = "event_indicator",
    predictors = c("age", "stage", "biomarker1", "biomarker2"),
    algorithm = "mboost",
    n_trees = 100,
    learning_rate = 0.1,
    max_depth = 3,
    cv_folds = 5
)