Skip to contents

Advanced decision tree analysis using modern algorithms for clinical research and medical decision making. Extends basic decision tree functionality with gradient boosting, conditional inference trees, ensemble methods, and enhanced interpretability. Provides comprehensive performance evaluation, feature importance analysis, and clinical-focused visualizations. Complements existing ClinicoPath decision tree modules with state-of-the-art algorithms.

Usage

advancedtree(
  data,
  vars,
  facs,
  target,
  targetLevel,
  algorithm = "rpart",
  validation = "cv",
  cv_folds = 5,
  test_split = 0.3,
  max_depth = 6,
  min_samples_split = 20,
  min_samples_leaf = 10,
  n_estimators = 100,
  learning_rate = 0.1,
  feature_selection = FALSE,
  importance_method = "gini",
  handle_imbalance = FALSE,
  imbalance_method = "weights",
  hyperparameter_tuning = FALSE,
  tuning_method = "random",
  show_tree_plot = TRUE,
  show_importance_plot = TRUE,
  show_performance_metrics = TRUE,
  show_validation_curves = TRUE,
  show_roc_curve = TRUE,
  show_calibration_plot = FALSE,
  show_confusion_matrix = TRUE,
  interpretability = FALSE,
  shap_analysis = FALSE,
  partial_dependence = FALSE,
  interaction_analysis = FALSE,
  clinical_context = "diagnosis",
  cost_sensitive_thresholds = FALSE,
  fn_fp_ratio = 1,
  missing_data_handling = "complete",
  export_model = FALSE,
  bootstrap_confidence = FALSE,
  n_bootstrap = 1000
)

Arguments

data

The data as a data frame containing clinical variables, biomarkers, and patient outcomes for advanced decision tree analysis.

vars

Continuous variables such as biomarker levels, age, laboratory values, or quantitative measurements.

facs

Categorical variables such as tumor grade, stage, histological type, or patient demographics.

target

Primary outcome variable: disease status, treatment response, survival status, or diagnostic category.

targetLevel

Level representing the positive outcome, event of interest, or disease presence for binary classification.

algorithm

Tree algorithm to use. Each offers different strengths: CART (interpretable), ctree (unbiased), Random Forest (robust), XGBoost (high performance), Extra Trees (fast), Ensemble (combined).

validation

Validation approach for performance estimation. Cross-validation provides robust estimates, bootstrap handles small samples, holdout for large datasets, temporal for time-series data.

cv_folds

Number of folds for cross-validation. 5-fold provides good balance between bias and variance for most clinical datasets.

test_split

Proportion of data reserved for testing (holdout validation). Typical values: 20-30\ ones.

max_depth

Maximum depth of individual trees. Deeper trees capture more interactions but may overfit. Clinical trees typically 2-8 levels.

min_samples_split

Minimum number of samples required to split a node. Higher values prevent overfitting in clinical data.

min_samples_leaf

Minimum number of samples in leaf nodes. Important for clinical validity - too few samples reduce reliability.

n_estimators

Number of trees in ensemble methods (Random Forest, XGBoost). More trees improve performance but increase computation time.

learning_rate

Learning rate for gradient boosting. Lower values require more trees but often achieve better performance.

feature_selection

Perform automated feature selection using tree-based importance. Helps identify most relevant clinical variables and biomarkers.

importance_method

Method for calculating feature importance. Permutation and SHAP provide more reliable importance for clinical interpretation.

handle_imbalance

Address class imbalance common in clinical data (rare diseases, adverse events). Uses algorithm-specific approaches.

imbalance_method

Method for handling class imbalance. Class weights are simple and effective, SMOTE generates synthetic samples.

hyperparameter_tuning

Perform automated hyperparameter tuning using grid search or random search. Improves model performance but increases runtime.

tuning_method

Hyperparameter optimization strategy. Random search is efficient for most problems, Bayesian for complex scenarios.

show_tree_plot

Display visual representation of the decision tree. Most informative for single trees, less useful for ensembles.

show_importance_plot

Display feature importance rankings. Critical for understanding which clinical variables drive predictions.

show_performance_metrics

Display comprehensive performance evaluation including accuracy, sensitivity, specificity, AUC, and clinical metrics.

show_validation_curves

Display learning curves and validation performance. Helps assess overfitting and training adequacy.

show_roc_curve

Display ROC curve analysis for binary classification. Essential for clinical decision making and threshold selection.

show_calibration_plot

Display probability calibration plot. Important for clinical applications requiring reliable probability estimates.

show_confusion_matrix

Display detailed confusion matrix with clinical interpretations. Shows actual vs predicted classifications.

interpretability

Perform advanced interpretability analysis including SHAP values, partial dependence plots, and interaction effects.

shap_analysis

Calculate SHAP (SHapley Additive exPlanations) values for individual prediction explanations. Powerful for clinical decision support.

partial_dependence

Show how individual features affect predictions across their value ranges. Helps understand clinical relationships.

interaction_analysis

Analyze interactions between clinical variables. Important for understanding combined effects of biomarkers.

clinical_context

Clinical application context. Affects performance thresholds, interpretation guidelines, and visualization emphasis.

cost_sensitive_thresholds

Optimize decision thresholds considering clinical costs of false positives vs false negatives.

fn_fp_ratio

Relative cost of missing a positive case vs false alarm. Screening (high ratio), confirmation tests (low ratio).

missing_data_handling

Strategy for handling missing data. Tree-based methods can handle missing values naturally in some algorithms.

export_model

Export the trained model for external use or deployment. Useful for clinical decision support system integration.

bootstrap_confidence

Calculate bootstrap confidence intervals for performance metrics. Provides uncertainty quantification for clinical reporting.

n_bootstrap

Number of bootstrap samples for confidence interval calculation. More samples provide better estimates but increase computation time.

Value

A results object containing:

results$todoa html
results$model_summarya html
results$tree_plotan image
results$importance_plotan image
results$performance_tablea table
results$roc_plotan image
results$validation_curvesan image
results$confusion_matrixa html
results$calibration_plotan image
results$shap_plotan image
results$partial_dependence_plotan image
results$interaction_plotan image
results$hyperparameter_resultsa html
results$clinical_interpretationa html
results$feature_selection_resultsa html
results$bootstrap_intervalsa html
results$model_exporta html

Tables can be converted to data frames with asDF or as.data.frame. For example:

results$performance_table$asDF

as.data.frame(results$performance_table)

Examples

# Example: Advanced decision tree for cancer prognosis
data(cancer_data)
#> Warning: data set ‘cancer_data’ not found
advancedtree(
    data = cancer_data,
    vars = c("age", "tumor_size", "biomarker1", "biomarker2"),
    facs = c("grade", "stage", "histology"),
    target = "outcome",
    targetLevel = "poor_prognosis",
    algorithm = "xgboost",
    validation = "cv",
    interpretability = TRUE
)
#> Error: Argument 'vars' contains 'tumor_size', 'biomarker1', 'biomarker2' which are not present in the dataset