LASSO Cox Regression for Variable Selection in Survival Analysis
Source:vignettes/jsurvival-lassocox-comprehensive.Rmd
jsurvival-lassocox-comprehensive.RmdNote: The
lassocox()function is designed for use within jamovi’s GUI. The code examples below show the R syntax for reference. To run interactively, usedevtools::load_all()and call the R6 class directly:lassocoxClass$new(options = lassocoxOptions$new(...), data = mydata).
LASSO Cox Regression
Overview
The Lasso-Cox Regression module
(lassocox) performs automatic variable selection in
survival analysis using L1-penalized Cox proportional hazards
regression. It identifies the most important predictors while preventing
overfitting — ideal for clinicopathological research with many candidate
variables.
Key features:
- Data suitability assessment (traffic-light system)
- Cross-validated lambda selection (lambda.min vs lambda.1se)
- Variable selection with coefficient shrinkage
- Risk score calculation and patient stratification
- Comprehensive visualizations (CV plot, coefficient plot, survival curves)
- Clinical interpretation guidance
What is LASSO Cox Regression?
LASSO (Least Absolute Shrinkage and Selection Operator) Cox regression combines the Cox proportional hazards model with L1 regularization for automatic variable selection.
Standard Cox model:
LASSO Cox adds a penalty:
where is the partial log-likelihood and controls penalty strength.
The effect: small coefficients shrink toward zero, irrelevant variables become exactly zero, and only important predictors remain in the model.
LASSO vs Ridge vs Elastic Net
| Method | Penalty | Variable Selection | Best For |
|---|---|---|---|
| Ridge | (L2) | No — shrinks but keeps all | Correlated predictors, all must stay |
| LASSO | (L1) | Yes — sets some to zero | Sparse models, feature selection |
| Elastic Net | Yes — grouped selection | Correlated groups should stay/go together |
In ClinicoPath: lassocox uses alpha=1 (pure LASSO). For
Elastic Net or Ridge, use the Penalized Cox Regression
module.
When to Use LASSO Cox Regression
| Scenario | Recommendation |
|---|---|
| Many predictors relative to sample (p >= n/3) | LASSO strongly indicated |
| Radiomics / genomics feature selection | LASSO ideal |
| Building a prognostic risk score | LASSO recommended |
| Few predictors (p < 5) with ample events | Standard Cox may suffice |
| Correlated predictors that should all be retained | Consider Elastic Net |
| Inference / p-values are the primary goal | Use standard Cox regression |
Datasets Used in This Guide
| Dataset | N | Events | Predictors | Description |
|---|---|---|---|---|
| Synthetic “standard” | 200 | 80 | 3 numeric | Clean scenario (all green suitability) |
| Synthetic “small” | 60 | 25 | 5 numeric | Yellow suitability scenario |
| Synthetic “collinear” | 100 | 45 | 4 (2 correlated) | Multicollinearity detection |
| Synthetic “categorical” | 100 | 40 | 3 factors | All categorical predictors |
histopathology |
250 | ~100 | 8 mixed | Real clinical pathology data |
lassocox_genomic |
80 | – | 50 gene features | High-dimensional genomic scenario |
lassocox_multicollinear |
180 | – | 12 correlated | Multicollinearity stress test |
lassocox_breast_cancer |
250 | – | 20 clinical | Breast cancer clinicopathological |
1. Basic Analysis (Green Suitability)
A standard scenario with adequate sample size, events, and predictors.
# Create synthetic data: 200 patients, 80 events, 3 numeric predictors
set.seed(42)
n <- 200
data_green <- data.frame(
time = abs(rnorm(n, 20, 10)) + 0.1,
status = factor(c(rep("event", 80), rep("censored", 120)),
levels = c("censored", "event")),
var1 = rnorm(n),
var2 = rnorm(n),
var3 = rnorm(n)
)
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultThe suitability report should show all green indicators:
- EPV: 80/3 = 26.7 (excellent)
- Sample size: n=200 (adequate)
- Event rate: 40% (balanced)
- Multicollinearity: uncorrelated predictors
2. Small Sample (Yellow Suitability)
When sample size is limited, the suitability assessment flags potential issues.
set.seed(42)
data_small <- data.frame(
time = abs(rnorm(60, 15, 8)) + 0.1,
status = factor(c(rep("event", 25), rep("censored", 35)),
levels = c("censored", "event")),
var1 = rnorm(60), var2 = rnorm(60), var3 = rnorm(60),
var4 = rnorm(60), var5 = rnorm(60)
)
lassocox(
data = data_small,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3", "var4", "var5"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultExpected suitability flags:
- EPV: 25/5 = 5.0 (yellow — adequate for LASSO but interpret with caution)
- Sample size: n=60 (yellow — small sample)
- Regularization: yellow — standard Cox may be an alternative
3. Multicollinearity Detection
When predictors are highly correlated, the suitability check identifies the pairs.
set.seed(42)
base_var <- rnorm(100)
data_collinear <- data.frame(
time = abs(rnorm(100, 20, 10)) + 0.1,
status = factor(c(rep("event", 45), rep("censored", 55)),
levels = c("censored", "event")),
var1 = base_var,
var2 = base_var + rnorm(100, 0, 0.1), # r > 0.95 with var1
var3 = rnorm(100),
var4 = rnorm(100)
)
lassocox(
data = data_collinear,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3", "var4"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultThe suitability report should flag:
- Multicollinearity: yellow/red with max |r| > 0.95
- Top correlated pair: var1 & var2
- Recommendation: Consider Elastic Net (Penalized Cox in ClinicoPath)
4. Categorical Predictors
LASSO handles factor variables via dummy coding. The collinearity check correctly excludes within-factor dummy correlations.
set.seed(42)
data_cat <- data.frame(
time = abs(rnorm(100, 20, 10)) + 0.1,
status = factor(c(rep("event", 40), rep("censored", 60)),
levels = c("censored", "event")),
grade = factor(sample(c("Low", "Med", "High"), 100, replace = TRUE)),
stage = factor(sample(c("I", "II", "III"), 100, replace = TRUE)),
treatment = factor(sample(c("A", "B"), 100, replace = TRUE))
)
lassocox(
data = data_cat,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("grade", "stage", "treatment"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no default5. Lambda Selection: lambda.min vs lambda.1se
The choice of lambda controls the trade-off between model complexity and parsimony.
# lambda.min: minimum CV error — more variables, better in-sample fit
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.min",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no default
# lambda.1se: most regularized within 1 SE — fewer variables, better generalization
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultInterpretation:
- lambda.min: Selects more variables. Better in-sample performance but higher overfitting risk.
- lambda.1se (default): More parsimonious. Recommended for clinical models that need to generalize.
6. Suitability Check Toggle
The suitability assessment is advisory — it never blocks the analysis.
# Disable suitability assessment
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE, # No suitability report shown
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no default7. Explanatory Output Options
Method explanations
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE,
showSummary = TRUE, # Natural-language results paragraph for manuscripts
showExplanations = TRUE, # LASSO methodology + plot explanations
showMethodologyNotes = TRUE, # Technical details
includeClinicalGuidance = TRUE # Clinical interpretation guide
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultVariable importance and model comparison
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.min", # More likely to select variables
nfolds = 5,
standardize = TRUE,
cv_plot = FALSE,
coef_plot = FALSE,
survival_plot = FALSE,
showVariableImportance = TRUE, # Importance scores + stability ranks
showModelComparison = TRUE # LASSO vs Standard Cox comparison
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no default8. Risk Score Output
The computed risk score can be saved as a new variable in the dataset.
result <- lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
lambda = "lambda.min",
nfolds = 5,
standardize = TRUE,
cv_plot = FALSE,
coef_plot = FALSE,
survival_plot = FALSE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultThe risk score is computed as:
where only LASSO-selected variables have non-zero coefficients.
Interpreting Results
Model Summary Table
| Statistic | Meaning |
|---|---|
| Total Variables | Number of candidate predictors (after dummy coding) |
| Selected Variables | Variables with non-zero coefficients |
| Selection Proportion | Percentage of variables retained |
| Optimal Lambda | The regularization strength used |
| Sample Size | Complete cases used |
| Number of Events | Events observed |
| Censoring Rate | Percentage of censored observations |
Selected Variables Table
| Column | Meaning |
|---|---|
| Variable | Predictor name |
| Coefficient | Log hazard ratio (positive = risk, negative = protective) |
| Hazard Ratio | exp(coefficient) — multiplicative effect on hazard |
| 95% CI (Lower/Upper) | Confidence interval from post-selection unpenalized Cox refit |
| p | P-value from post-selection Cox refit (interpret with caution) |
| Importance | Absolute coefficient value |
Common Pitfalls and Best Practices
Pitfalls to Avoid
Using LASSO with too few events: EPV < 2 produces unreliable results. Even with regularization, LASSO needs sufficient events — a common problem in small radiomics cohorts (e.g., 80 patients with 300 features).
Not standardizing variables: LASSO is sensitive to scale. A variable measured in 0–20,000 (e.g., tumor area) will dominate one measured in 0–100 (e.g., age) unless standardized. Use
standardize = TRUE(the default).Interpreting coefficients as p-values: LASSO performs selection, not testing. Incorrect: “LASSO identified Ki67 as statistically significant.” Correct: “LASSO selected Ki67 as an important predictor.”
Feature selection on full data (data leakage): A very common mistake. If you perform LASSO feature selection on all patients, then build and test the model on the same data, information leaks from the test set into the selection step. Feature selection must be done within the training set only.
-
No external validation: Internal cross-validation produces optimistically biased estimates. Journals increasingly require external validation cohorts. The validation hierarchy:
- Internal cross-validation (minimum)
- Internal train/test split
- External independent dataset (preferred)
Ignoring feature stability: Radiomic or WSI features may change with scanner, staining protocol, segmentation method, or resolution. Before LASSO, features should be tested for robustness using intraclass correlation coefficient (ICC) or test–retest analysis.
Reporting only selected variables: Good reporting requires: selected features, lambda value used, cross-validation method, coefficient values, and performance metrics. Follow the TRIPOD statement for transparent reporting.
Ignoring clinical variables: Some radiomics papers include only image features. Combined models (clinical + pathological + radiomic + molecular) usually perform better.
Not assessing model performance properly: Survival models should report C-index and calibration curves. High discrimination (C-index) alone is insufficient — the model must also be well-calibrated. Consider decision curve analysis for clinical utility.
No calibration analysis: A model with high AUC/C-index but poor calibration makes systematically wrong predictions. Always check calibration plots and consider the Hosmer–Lemeshow test or calibration slope.
Best Practices
- Check suitability first: Use the built-in assessment before interpreting results
- Prefer lambda.1se: More parsimonious and generalizable than lambda.min
- Report transparently: Lambda, CV method, all selected variables, coefficients
- Validate externally: Test in independent cohorts when possible
- Combine with clinical variables: Don’t use only image/genomic features
- Follow TRIPOD guidelines: For prognostic model reporting
- Test feature stability: Use ICC before entering features into LASSO
- Perform selection within training set: Never select features on the full dataset
Radiomics and Digital Pathology Pipeline
LASSO Cox regression is central to many radiomics and digital pathology prognostic studies. A robust analysis pipeline follows these steps:
WSI / Imaging Data
↓
Feature Extraction (e.g., texture, morphology, deep features)
↓
Feature Stability Testing (ICC, test–retest)
↓
Feature Reduction (remove unstable/redundant features)
↓
LASSO Cox Selection (cross-validated, training set only)
↓
Model Construction (risk score from selected features)
↓
Internal Validation (bootstrap or cross-validation)
↓
External Validation (independent cohort)
↓
Calibration + Decision Curve Analysis
Application Areas
| Domain | Example Use |
|---|---|
| WSI radiomics | Texture and morphology features from whole-slide images |
| Deep feature selection | CNN-extracted features reduced via LASSO for interpretability |
| Tumor microenvironment signatures | Spatial features (immune cell density, stroma ratio) |
| Gene expression survival models | Selecting prognostic genes from expression panels |
Typical Wording in Papers
When reporting LASSO Cox results, consider these example phrasings:
“Feature selection was performed using LASSO penalized Cox regression with 10-fold cross-validation. The selected features were used to construct a radiomic risk score.”
“Variables were selected using LASSO regression with 10-fold cross-validation, and selected predictors were entered into the multivariable Cox model.”
9. Using Real Test Datasets
The package includes several realistic test datasets for LASSO Cox analysis.
Breast cancer clinical data
data(lassocox_breast_cancer)
lassocox(
data = lassocox_breast_cancer,
elapsedtime = "survival_months",
outcome = "death",
outcomeLevel = "Dead",
explanatory = c("age", "tumor_size_cm", "grade", "lymph_nodes_positive",
"ki67_percent", "er_status", "her2_status", "lvi"),
suitabilityCheck = FALSE,
lambda = "lambda.1se",
nfolds = 10,
standardize = TRUE,
cv_plot = TRUE,
coef_plot = TRUE,
survival_plot = TRUE,
showSummary = TRUE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultThe showSummary = TRUE option produces a
natural-language results paragraph suitable for copying into manuscripts
or reports.
10. Reproducibility with Random Seeds
The random_seed option ensures reproducible
cross-validation fold assignments. Using the same seed produces
identical results every time.
# Fixed seed for reproducible results
lassocox(
data = data_green,
elapsedtime = "time",
outcome = "status",
outcomeLevel = "event",
explanatory = c("var1", "var2", "var3"),
suitabilityCheck = FALSE,
random_seed = 42,
lambda = "lambda.1se",
nfolds = 5,
standardize = TRUE,
cv_plot = FALSE,
coef_plot = FALSE,
survival_plot = FALSE
)
#> Error in `lassocox()`:
#> ! argument "censorLevel" is missing, with no defaultDifferent seeds may produce different variable selections, especially when predictors have similar importance. For publications, always report the seed used.
Related ClinicoPath Functions
| Function | Use When |
|---|---|
| Survival Analysis | Standard Kaplan-Meier with log-rank test (few predictors) |
| Multivariable Survival | Standard Cox with multiple predictors (no regularization needed) |
| Penalized Cox Regression | Elastic Net or Ridge (retain correlated predictors) |
| Adaptive LASSO | Two-stage LASSO with oracle properties |
| Group LASSO | Select/exclude grouped variables together |
References
- Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385-395.
- Simon N, et al. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1-13.
- Collins GS, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD). BMJ. 2015;350:g7594.