High-Dimensional Cox Regression - Comprehensive Guide
Source:vignettes/jsurvival-highdimcox-comprehensive.Rmd
jsurvival-highdimcox-comprehensive.RmdNote: The
highdimcox()function is designed for use within jamovi’s GUI. The code examples below show the R syntax for reference.
High-Dimensional Cox Regression
Overview
The High-Dimensional Cox Regression module
(highdimcox) provides a unified interface for regularized
Cox regression when the number of predictors is large relative to sample
size. It supports LASSO, Ridge,
Elastic Net, and Adaptive LASSO
regularization, with optional stability selection for robust variable
identification.
This module is designed for:
- Genomic survival studies (gene expression panels, mutation data)
- Proteomic/metabolomic data (high-dimensional biomarker panels)
- Radiomics studies (texture and morphology features from imaging)
- Any p >> n survival scenario
Key features:
- Multiple regularization methods in one interface
- Cross-validation with configurable lambda selection (min, 1se)
- Stability selection via bootstrap for robust variable identification
- Variable importance and coefficient visualization
- Model diagnostics and performance metrics
- Data suitability assessment with traffic-light checks
Regularization Methods Explained
When you have more predictors than observations (or close to it), standard Cox regression fails. Regularization adds a penalty to prevent overfitting.
| Method | Penalty | Alpha | Selects Variables? | Best For |
|---|---|---|---|---|
| LASSO | 1.0 | Yes (sets to zero) | Sparse models | |
| Ridge | 0.0 | No (shrinks all) | Prediction, collinearity | |
| Elastic Net | Mixed L1+L2 | 0.5 | Yes (grouped) | Correlated groups |
| Adaptive LASSO | Weighted L1 | 1.0 | Yes (oracle) | Publication models |
The Elastic Net (default, alpha=0.5) is the recommended starting point — it combines the variable selection of LASSO with the stability of Ridge.
Datasets Used in This Guide
| Dataset | N | Events | Predictors | Description |
|---|---|---|---|---|
| Synthetic “genomic” | 150 | ~75 | 100 genes + 5 clinical | Gene expression survival study |
| Synthetic “proteomic” | 80 | ~50 | 50 proteins + 3 clinical | Smaller proteomic study |
1. LASSO Regularization
Pure LASSO selects the sparsest model — ideal when you expect few true predictors.
# Simulate a genomic survival study (150 patients, 100 genes)
set.seed(42)
n <- 150
p_genes <- 100
# Gene expression matrix
gene_matrix <- matrix(rnorm(n * p_genes), nrow = n)
colnames(gene_matrix) <- paste0("GENE_", sprintf("%03d", 1:p_genes))
# True effects: 6 genes affect survival
true_effects <- rep(0, p_genes)
true_effects[c(5, 12, 27, 43, 68, 91)] <- c(0.8, -0.6, 0.5, -0.4, 0.7, -0.5)
# Clinical variables
age <- rnorm(n, 62, 10)
gender <- factor(sample(c("Male", "Female"), n, replace = TRUE))
stage <- factor(sample(c("I", "II", "III"), n, replace = TRUE,
prob = c(0.3, 0.4, 0.3)))
grade <- factor(sample(1:3, n, replace = TRUE))
treatment <- factor(sample(c("A", "B"), n, replace = TRUE))
# Survival times
lp <- gene_matrix %*% true_effects + 0.02 * (age - 62) + 0.3 * (as.numeric(stage) - 1)
surv_time <- rweibull(n, shape = 1.3, scale = 30 * exp(-as.numeric(lp) * 0.2))
censor_time <- runif(n, 6, 48)
time <- pmax(pmin(surv_time, censor_time), 0.1)
event <- factor(ifelse(surv_time <= censor_time, "Dead", "Alive"),
levels = c("Alive", "Dead"))
data_genomic <- data.frame(
survival_months = time,
vital_status = event,
age = age, gender = gender, stage = stage,
grade = grade, treatment = treatment,
as.data.frame(gene_matrix)
)
cat("N:", n, "Events:", sum(event == "Dead"),
"Event rate:", round(mean(event == "Dead"), 2), "\n")
#> N: 150 Events: 90 Event rate: 0.6
cat("Predictors:", ncol(data_genomic) - 2, "\n")
#> Predictors: 105
# All gene predictors (no clinical for clean LASSO demonstration)
gene_vars <- paste0("GENE_", sprintf("%03d", 1:p_genes))
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "lasso",
cv_method = "cv_1se",
cv_folds = 10,
show_regularization_path = TRUE,
show_cv_plot = TRUE,
show_variable_importance = TRUE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset2. Ridge Regularization
Ridge keeps all variables but shrinks coefficients — better for prediction when you believe many variables contribute small effects.
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "ridge",
cv_method = "cv_min",
cv_folds = 10,
show_cv_plot = TRUE,
show_coefficients_table = TRUE,
show_model_diagnostics = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the datasetNote: Ridge regression never sets coefficients to exactly zero. All variables are retained. A table note explains this behavior.
3. Elastic Net (Default)
The recommended starting point — combines LASSO selection with Ridge stability.
pred_vars <- c("age", "gender", "stage",
paste0("GENE_", sprintf("%03d", 1:p_genes)))
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = pred_vars,
regularization_method = "elastic_net",
alpha_value = 0.5,
cv_method = "cv_1se",
cv_folds = 10,
show_regularization_path = TRUE,
show_cv_plot = TRUE,
show_variable_importance = TRUE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'pred_vars' which is not present in the datasetAlpha Tuning
The alpha parameter controls the balance between L1 (LASSO) and L2 (Ridge). Higher alpha → more variable selection; lower alpha → more shrinkage.
# More LASSO-like (alpha = 0.8)
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "elastic_net",
alpha_value = 0.8,
cv_folds = 10,
show_coefficients_table = TRUE,
show_cv_plot = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset4. Adaptive LASSO
Two-stage approach: first fits Ridge to get initial coefficient estimates, then uses inverse-coefficient weights to penalize unimportant variables more heavily. This gives oracle variable selection properties.
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "adaptive_lasso",
cv_method = "cv_1se",
cv_folds = 10,
show_regularization_path = TRUE,
show_cv_plot = TRUE,
show_variable_importance = TRUE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset5. Stability Selection
Bootstrap-based stability selection identifies variables consistently selected across many random subsamples. This follows Meinshausen & Buhlmann (2010) with a fixed lambda across all subsamples for proper error control.
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "elastic_net",
alpha_value = 0.5,
cv_folds = 10,
stability_selection = TRUE,
subsampling_iterations = 100,
subsampling_ratio = 0.5,
stability_threshold = 0.8,
show_variable_importance = TRUE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the datasetVariables selected in >80% of bootstrap iterations are considered robustly important.
Custom Stability Parameters
# Lower threshold and different subsampling ratio
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "lasso",
cv_folds = 10,
stability_selection = TRUE,
subsampling_iterations = 200,
subsampling_ratio = 0.6,
stability_threshold = 0.6,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset6. CV Method Comparison
The choice of CV lambda selection affects model complexity.
# Minimum CV error — more variables, better fit but risk of overfitting
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "lasso",
cv_method = "cv_min",
cv_folds = 10,
show_coefficients_table = TRUE,
show_cv_plot = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset
# 1-SE rule — more parsimonious, better generalization
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
regularization_method = "lasso",
cv_method = "cv_1se",
cv_folds = 10,
show_coefficients_table = TRUE,
show_cv_plot = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset7. Data Suitability Assessment
The suitability check runs 6 diagnostics with traffic-light indicators:
- Events-Per-Variable (EPV) — green ≥10, yellow ≥1, red <1
- Regularization Need — green if p ≥ n/3
- Sample Size — green ≥100, yellow ≥30, red <30
- Event Rate — green 20–80%, yellow otherwise
- Multicollinearity — max pairwise |r| (skipped if p > 2000)
- Data Quality — missing data and constant predictors
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
suitabilityCheck = TRUE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset
# Disable suitability assessment
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars,
suitabilityCheck = FALSE,
show_coefficients_table = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'gene_vars' which is not present in the dataset8. Smaller Dataset (Proteomic)
# 80 patients, 50 protein markers
set.seed(123)
n2 <- 80
p_prot <- 50
prot_matrix <- matrix(rnorm(n2 * p_prot, mean = 10, sd = 3), nrow = n2)
colnames(prot_matrix) <- paste0("PROT_", sprintf("%02d", 1:p_prot))
# 4 proteins with true effects
true_prot <- rep(0, p_prot)
true_prot[c(3, 17, 28, 42)] <- c(0.15, -0.12, 0.18, -0.10)
lp2 <- prot_matrix %*% true_prot
surv_time2 <- rweibull(n2, shape = 1.1, scale = 24 * exp(-as.numeric(lp2) * 0.5))
censor_time2 <- runif(n2, 6, 36)
time2 <- pmax(pmin(surv_time2, censor_time2), 0.1)
event2 <- factor(ifelse(surv_time2 <= censor_time2, "Dead", "Alive"),
levels = c("Alive", "Dead"))
data_proteomic <- data.frame(
time = time2, status = event2,
age = rnorm(n2, 55, 12),
sex = factor(sample(c("M", "F"), n2, replace = TRUE)),
tumor_size_cm = rnorm(n2, 3.5, 1.2),
as.data.frame(prot_matrix)
)
cat("N:", n2, "Events:", sum(event2 == "Dead"),
"Event rate:", round(mean(event2 == "Dead"), 2), "\n")
#> N: 80 Events: 52 Event rate: 0.65
prot_vars <- c("age", "sex", "tumor_size_cm",
paste0("PROT_", sprintf("%02d", 1:p_prot)))
highdimcox(
data = data_proteomic,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = prot_vars,
regularization_method = "elastic_net",
alpha_value = 0.5,
cv_method = "cv_1se",
cv_folds = 5,
show_regularization_path = TRUE,
show_cv_plot = TRUE,
show_variable_importance = TRUE,
show_coefficients_table = TRUE,
showExplanations = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'prot_vars' which is not present in the dataset9. Mixed Predictors (Clinical + Genomic)
Demonstrates automatic factor dummy-encoding for mixed predictor types.
mixed_vars <- c("age", "gender", "stage", "grade", "treatment",
paste0("GENE_", sprintf("%03d", 1:20)))
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = mixed_vars,
regularization_method = "elastic_net",
alpha_value = 0.5,
cv_folds = 10,
show_coefficients_table = TRUE,
show_variable_importance = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'mixed_vars' which is not present in the datasetFactor variables like stage and gender are
automatically converted to dummy variables (e.g., “stage: II”, “stage:
III”) in the output.
10. Explanatory Output
highdimcox(
data = data_genomic,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars[1:20],
regularization_method = "elastic_net",
cv_folds = 10,
show_coefficients_table = TRUE,
showSummaries = TRUE,
showExplanations = TRUE
)
#>
#> HIGH-DIMENSIONAL COX REGRESSION
#>
#> <div class='alert alert-success'>
#>
#> Analysis completed: 150 observations, 90 events, 20 predictors, 0
#> selected via elastic_net (C-index=0.500).
#>
#> <div style='background-color: #fff3cd; color: #856404; border: 1px
#> solid #ffeeba; padding: 12px; border-radius: 6px; margin-bottom:
#> 12px;'>Overall: Data is usable but review the flagged items.<table
#> style='width: 100%; border-collapse: collapse; font-size: 13px;'><tr
#> style='border-bottom: 2px solid #dee2e6;'><th style='padding: 6px;
#> text-align: left;'>Status<th style='padding: 6px; text-align:
#> left;'>Check<th style='padding: 6px; text-align: left;'>Value<th
#> style='padding: 6px; text-align: left;'>Detail<tr
#> style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #ffc107; font-size: 18px;'>●<td
#> style='padding: 6px;'>Events-Per-Variable (Overall)<td style='padding:
#> 6px;'>4.5 (n_events=90, p=20)<td style='padding: 6px;'>Adequate for
#> regularized regression, which handles low EPV better than standard
#> Cox.<tr style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #ffc107; font-size: 18px;'>●<td
#> style='padding: 6px;'>Regularization Need<td style='padding:
#> 6px;'>p=20, EPV=4<td style='padding: 6px;'>Moderate/low
#> dimensionality. Standard Cox may also suffice.<tr
#> style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #28a745; font-size: 18px;'>●<td
#> style='padding: 6px;'>Sample Size<td style='padding: 6px;'>n=150<td
#> style='padding: 6px;'>Adequate sample size for penalized
#> regression.<tr style='border-bottom: 1px solid #dee2e6;'><td
#> style='padding: 6px;'><span style='color: #28a745; font-size:
#> 18px;'>●<td style='padding: 6px;'>Event Rate<td style='padding:
#> 6px;'>60.0% (90/150)<td style='padding: 6px;'>Balanced event rate.
#> Good for model estimation.<tr style='border-bottom: 1px solid
#> #dee2e6;'><td style='padding: 6px;'><span style='color: #28a745;
#> font-size: 18px;'>●<td style='padding:
#> 6px;'>Multicollinearity<td style='padding: 6px;'>Max |r| = 0.26<td
#> style='padding: 6px;'>No concerning collinearity detected.<tr
#> style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #28a745; font-size: 18px;'>●<td
#> style='padding: 6px;'>Data Quality<td style='padding: 6px;'>No
#> issues<td style='padding: 6px;'>Complete data with no constant
#> predictors.
#>
#> High-Dimensional Cox Regression Results
#>
#> Regularization: elastic_net (α = 0.5)
#>
#> Selected Lambda: 0.413130722301642
#>
#> Variables: 20 candidate variables → 0 selected
#>
#> Cross-Validation: 10-fold CV
#>
#> Training C-index (optimistic): 0.5
#>
#> Selected Variables
#> ─────────────────────────────────────
#> Variable β HR Importance
#> ─────────────────────────────────────
#> ─────────────────────────────────────
#> Note. No variables were
#> selected at the chosen
#> regularization level. Consider
#> using a less restrictive lambda
#> (minimum CV) or a different
#> regularization method.
#>
#>
#> Regularization Metrics
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Metric Value Interpretation
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Selected Lambda 0.413130722301642 Optimal regularization strength
#> Lambda Min 0.162947269965384 Lambda minimizing CV error
#> Lambda 1SE 0.413130722301642 Lambda within 1-SE of minimum
#> CV Deviance at Selected Lambda 6.336 Cross-validated model deviance
#> Training C-index (optimistic) 0.5 Training-set estimate; likely overestimates true discrimination. Validate externally.
#> Number of Selected Variables 0 Variables with non-zero coefficients
#> Regularization Method elastic_net (α=0.5) Applied regularization strategy
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#>
#> Analysis Summary
#>
#> High-dimensional Cox regression analysis was performed on 20 predictor
#> variables using elastic_net regularization.
#>
#> Model Selection: Cross-validation with 10 folds identified an optimal
#> regularization parameter (λ = 0.413130722301642) that selected
#> 0 variables from the candidate set.
#>
#> Training C-index (optimistic): 0.5
#>
#> Interpretation: The selected variables represent the most predictive
#> features for survival outcome after accounting for multiple testing
#> and overfitting through regularization. Variables with larger absolute
#> coefficients have stronger associations with survival risk.
#>
#> High-Dimensional Cox Regression Methodology
#>
#> Overview
#>
#> High-dimensional Cox regression extends traditional Cox proportional
#> hazards modeling to handle datasets where the number of predictors (p)
#> may exceed or approach the number of observations (n). This scenario
#> is common in genomic, proteomic, and other high-throughput biomedical
#> research contexts.
#>
#> Regularization Methods
#>
#> LASSO (L1): Performs automatic variable selection by shrinking some
#> coefficients to exactly zeroRidge (L2): Shrinks coefficients toward
#> zero but retains all variables, useful when predictors are
#> correlatedElastic Net: Combines L1 and L2 penalties, balancing
#> variable selection and coefficient shrinkageAdaptive LASSO: Uses
#> data-driven penalty weights for improved variable selection properties
#>
#> Cross-Validation
#>
#> The regularization parameter (λ) is selected using cross-validation to
#> optimize prediction performance. The '1-SE rule' selects a more
#> parsimonious model by choosing the largest λ within one standard error
#> of the minimum cross-validation error.
#>
#> Stability Selection
#>
#> When enabled, stability selection performs variable selection across
#> multiple bootstrap samples to identify variables that are consistently
#> selected. This provides a measure of selection confidence and helps
#> identify the most robust predictive features.
#>
#> Clinical Interpretation
#>
#> Selected variables and their coefficients can be used to:
#>
#> Identify key biomarkers associated with survivalDevelop prognostic
#> signatures for risk stratificationGuide hypothesis generation for
#> follow-up studiesBuild personalized survival prediction models11. Full-Feature Demonstration
All options enabled simultaneously.
highdimcox(
data = data_proteomic,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = prot_vars,
regularization_method = "elastic_net",
alpha_value = 0.7,
cv_method = "cv_min",
cv_folds = 5,
stability_selection = TRUE,
subsampling_iterations = 100,
subsampling_ratio = 0.5,
stability_threshold = 0.7,
suitabilityCheck = TRUE,
show_regularization_path = TRUE,
show_cv_plot = TRUE,
show_variable_importance = TRUE,
show_coefficients_table = TRUE,
show_model_diagnostics = TRUE,
showSummaries = TRUE,
showExplanations = TRUE
)
#> Error:
#> ! Argument 'predictors' contains 'prot_vars' which is not present in the dataset12. Edge Cases
Small sample (near minimum)
small_data <- data_proteomic[1:35, ]
highdimcox(
data = small_data,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = paste0("PROT_", sprintf("%02d", 1:10)),
regularization_method = "elastic_net",
cv_folds = 5,
show_coefficients_table = TRUE,
suitabilityCheck = TRUE
)
#>
#> HIGH-DIMENSIONAL COX REGRESSION
#>
#> <div class='alert alert-success'>
#>
#> Analysis completed: 35 observations, 25 events, 10 predictors, 0
#> selected via elastic_net (C-index=0.500).
#>
#> <div style='background-color: #fff3cd; color: #856404; border: 1px
#> solid #ffeeba; padding: 12px; border-radius: 6px; margin-bottom:
#> 12px;'>Overall: Data is usable but review the flagged items.<table
#> style='width: 100%; border-collapse: collapse; font-size: 13px;'><tr
#> style='border-bottom: 2px solid #dee2e6;'><th style='padding: 6px;
#> text-align: left;'>Status<th style='padding: 6px; text-align:
#> left;'>Check<th style='padding: 6px; text-align: left;'>Value<th
#> style='padding: 6px; text-align: left;'>Detail<tr
#> style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #ffc107; font-size: 18px;'>●<td
#> style='padding: 6px;'>Events-Per-Variable (Overall)<td style='padding:
#> 6px;'>2.5 (n_events=25, p=10)<td style='padding: 6px;'>Adequate for
#> regularized regression, which handles low EPV better than standard
#> Cox.<tr style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #ffc107; font-size: 18px;'>●<td
#> style='padding: 6px;'>Regularization Need<td style='padding:
#> 6px;'>p=10, EPV=2<td style='padding: 6px;'>Moderate/low
#> dimensionality. Standard Cox may also suffice.<tr
#> style='border-bottom: 1px solid #dee2e6;'><td style='padding:
#> 6px;'><span style='color: #ffc107; font-size: 18px;'>●<td
#> style='padding: 6px;'>Sample Size<td style='padding: 6px;'>n=35<td
#> style='padding: 6px;'>Small sample. CV folds may be somewhat
#> unstable.<tr style='border-bottom: 1px solid #dee2e6;'><td
#> style='padding: 6px;'><span style='color: #28a745; font-size:
#> 18px;'>●<td style='padding: 6px;'>Event Rate<td style='padding:
#> 6px;'>71.4% (25/35)<td style='padding: 6px;'>Balanced event rate. Good
#> for model estimation.<tr style='border-bottom: 1px solid #dee2e6;'><td
#> style='padding: 6px;'><span style='color: #28a745; font-size:
#> 18px;'>●<td style='padding: 6px;'>Multicollinearity<td
#> style='padding: 6px;'>Max |r| = 0.39<td style='padding: 6px;'>No
#> concerning collinearity detected.<tr style='border-bottom: 1px solid
#> #dee2e6;'><td style='padding: 6px;'><span style='color: #28a745;
#> font-size: 18px;'>●<td style='padding: 6px;'>Data Quality<td
#> style='padding: 6px;'>No issues<td style='padding: 6px;'>Complete data
#> with no constant predictors.
#>
#> High-Dimensional Cox Regression Results
#>
#> Regularization: elastic_net (α = 0.5)
#>
#> Selected Lambda: 0.50541683726211
#>
#> Variables: 10 candidate variables → 0 selected
#>
#> Cross-Validation: 5-fold CV
#>
#> Training C-index (optimistic): 0.5
#>
#> Selected Variables
#> ─────────────────────────────────────
#> Variable β HR Importance
#> ─────────────────────────────────────
#> ─────────────────────────────────────
#> Note. No variables were
#> selected at the chosen
#> regularization level. Consider
#> using a less restrictive lambda
#> (minimum CV) or a different
#> regularization method.
#>
#>
#> Regularization Metrics
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Metric Value Interpretation
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Selected Lambda 0.50541683726211 Optimal regularization strength
#> Lambda Min 0.50541683726211 Lambda minimizing CV error
#> Lambda 1SE 0.50541683726211 Lambda within 1-SE of minimum
#> CV Deviance at Selected Lambda 5.16 Cross-validated model deviance
#> Training C-index (optimistic) 0.5 Training-set estimate; likely overestimates true discrimination. Validate externally.
#> Number of Selected Variables 0 Variables with non-zero coefficients
#> Regularization Method elastic_net (α=0.5) Applied regularization strategy
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Below minimum observations
tiny_data <- data_proteomic[1:20, ]
highdimcox(
data = tiny_data,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = paste0("PROT_", sprintf("%02d", 1:5)),
show_coefficients_table = TRUE
)
#>
#> HIGH-DIMENSIONAL COX REGRESSION
#>
#> <div class='alert alert-danger'>
#>
#> Validation Error
#>
#> At least 30 observations required for high-dimensional analysis (found
#> 20).
#>
#> character(0)
#>
#> character(0)
#>
#> Selected Variables
#> ─────────────────────────────────────
#> Variable β HR Importance
#> ─────────────────────────────────────
#> ─────────────────────────────────────
#>
#>
#> Regularization Metrics
#> ─────────────────────────────────────
#> Metric Value Interpretation
#> ─────────────────────────────────────
#> ─────────────────────────────────────All censored (no events)
no_events <- data_genomic
no_events$vital_status <- factor("Alive", levels = c("Alive", "Dead"))
highdimcox(
data = no_events,
elapsedtime = "survival_months",
outcome = "vital_status",
outcomeLevel = "Dead",
censorLevel = "Alive",
predictors = gene_vars[1:20],
show_coefficients_table = TRUE
)
#>
#> HIGH-DIMENSIONAL COX REGRESSION
#>
#> <div class='alert alert-danger'>
#>
#> Validation Error
#>
#> No rows match event level 'Dead' in the outcome variable.
#>
#> character(0)
#>
#> character(0)
#>
#> Selected Variables
#> ─────────────────────────────────────
#> Variable β HR Importance
#> ─────────────────────────────────────
#> ─────────────────────────────────────
#>
#>
#> Regularization Metrics
#> ─────────────────────────────────────
#> Metric Value Interpretation
#> ─────────────────────────────────────
#> ─────────────────────────────────────Interpreting Results
Coefficient Table
| Column | Meaning |
|---|---|
| Variable | Predictor name (or “Variable: Level” for factor dummies) |
| Coefficient (β) | Regularized log hazard ratio |
| Hazard Ratio (HR) | exp(coefficient) — HR > 1 = increased risk |
| Importance | Absolute coefficient value |
Variable Importance Plot
Variables ranked by absolute regularized coefficient. Top 25 shown. Selected variables (non-zero coefficients) are highlighted.
Regularization Path
Shows how each variable’s coefficient changes as lambda increases. Variables that persist at higher lambda values are more robust predictors.
Method Selection Guide
| Your Data | Recommended Method | Alpha |
|---|---|---|
| p >> n, expect few true predictors | LASSO | 1.0 |
| p >> n, many small effects | Ridge | 0.0 |
| p >> n, correlated groups | Elastic Net | 0.3-0.7 |
| p > n, want publishable model | Adaptive LASSO | 1.0 |
| Moderate p, clinical study | Elastic Net | 0.5 |
| Prediction is primary goal | Ridge or Elastic Net | 0.0-0.5 |
| Interpretation is primary goal | LASSO or Adaptive LASSO | 0.8-1.0 |
Common Pitfalls
Using Ridge when you need variable selection: Ridge never sets coefficients to exactly zero. For interpretable models, use LASSO or Elastic Net.
Not adjusting CV folds for small samples: With n < 100, use cv_folds = 5 instead of 10 to ensure adequate events per fold.
Ignoring stability selection: A single regularized model may give unstable variable selections. Use stability selection (subsampling_iterations >= 200) for robust results.
Comparing models with different alpha values: Models with different alpha values optimize different objectives. Compare using the same alpha or use nested cross-validation.
Not reporting the regularization method: Always report which method (LASSO/Ridge/Elastic Net), alpha value, CV method, and number of selected variables.
Training C-index overestimates performance: The reported C-index is computed on the training data and is optimistically biased. Always validate on external data before clinical use.
Related ClinicoPath Functions
| Function | Use When |
|---|---|
LASSO Cox (lassocox) |
Pure LASSO with suitability assessment and clinical guidance |
Adaptive LASSO (adaptivelasso) |
Dedicated adaptive LASSO with more stability options |
SCAD Cox (ncvregcox) |
Non-convex penalties (SCAD, MCP) avoiding LASSO bias |
PLS Cox (plscox) |
Dimensionality reduction approach for very high p |
Multivariable Survival (survival) |
Standard Cox when p is small |
References
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005;67(2):301-320.
- Simon N, et al. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1-13.
- Meinshausen N, Buhlmann P. Stability selection. J R Stat Soc Ser B. 2010;72(4):417-473.
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58(1):267-288.
- Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418-1429.