Advanced Decision Tree Analysis — advancedtree • ClinicoPath

Advanced decision tree analysis using modern algorithms for clinical research and medical decision making. Extends basic decision tree functionality with gradient boosting, conditional inference trees, ensemble methods, and enhanced interpretability. Provides comprehensive performance evaluation, feature importance analysis, and clinical-focused visualizations. Complements existing ClinicoPath decision tree modules with state-of-the-art algorithms.

Usage

advancedtree(
  data,
  vars,
  facs,
  target,
  targetLevel,
  algorithm = "rpart",
  validation = "cv",
  cv_folds = 5,
  test_split = 0.3,
  max_depth = 6,
  min_samples_split = 20,
  min_samples_leaf = 10,
  n_estimators = 100,
  learning_rate = 0.1,
  feature_selection = FALSE,
  importance_method = "gini",
  handle_imbalance = FALSE,
  imbalance_method = "weights",
  hyperparameter_tuning = FALSE,
  tuning_method = "random",
  show_tree_plot = TRUE,
  show_importance_plot = TRUE,
  show_performance_metrics = TRUE,
  show_validation_curves = TRUE,
  show_roc_curve = TRUE,
  show_calibration_plot = FALSE,
  show_confusion_matrix = TRUE,
  interpretability = FALSE,
  shap_analysis = FALSE,
  partial_dependence = FALSE,
  interaction_analysis = FALSE,
  clinical_context = "diagnosis",
  cost_sensitive_thresholds = FALSE,
  fn_fp_ratio = 1,
  missing_data_handling = "complete",
  export_model = FALSE,
  bootstrap_confidence = FALSE,
  n_bootstrap = 1000
)

Arguments

data: The data as a data frame containing clinical variables, biomarkers, and patient outcomes for advanced decision tree analysis.
vars: Continuous variables such as biomarker levels, age, laboratory values, or quantitative measurements.
facs: Categorical variables such as tumor grade, stage, histological type, or patient demographics.
target: Primary outcome variable: disease status, treatment response, survival status, or diagnostic category.
targetLevel: Level representing the positive outcome, event of interest, or disease presence for binary classification.
algorithm: Tree algorithm to use. Each offers different strengths: CART (interpretable), ctree (unbiased), Random Forest (robust), XGBoost (high performance), Extra Trees (fast), Ensemble (combined).
validation: Validation approach for performance estimation. Cross-validation provides robust estimates, bootstrap handles small samples, holdout for large datasets, temporal for time-series data.
cv_folds: Number of folds for cross-validation. 5-fold provides good balance between bias and variance for most clinical datasets.
test_split: Proportion of data reserved for testing (holdout validation). Typical values: 20-30\ ones.
max_depth: Maximum depth of individual trees. Deeper trees capture more interactions but may overfit. Clinical trees typically 2-8 levels.
min_samples_split: Minimum number of samples required to split a node. Higher values prevent overfitting in clinical data.
min_samples_leaf: Minimum number of samples in leaf nodes. Important for clinical validity - too few samples reduce reliability.
n_estimators: Number of trees in ensemble methods (Random Forest, XGBoost). More trees improve performance but increase computation time.
learning_rate: Learning rate for gradient boosting. Lower values require more trees but often achieve better performance.
feature_selection: Perform automated feature selection using tree-based importance. Helps identify most relevant clinical variables and biomarkers.
importance_method: Method for calculating feature importance. Permutation and SHAP provide more reliable importance for clinical interpretation.
handle_imbalance: Address class imbalance common in clinical data (rare diseases, adverse events). Uses algorithm-specific approaches.
imbalance_method: Method for handling class imbalance. Class weights are simple and effective, SMOTE generates synthetic samples.
hyperparameter_tuning: Perform automated hyperparameter tuning using grid search or random search. Improves model performance but increases runtime.
tuning_method: Hyperparameter optimization strategy. Random search is efficient for most problems, Bayesian for complex scenarios.
show_tree_plot: Display visual representation of the decision tree. Most informative for single trees, less useful for ensembles.
show_importance_plot: Display feature importance rankings. Critical for understanding which clinical variables drive predictions.
show_performance_metrics: Display comprehensive performance evaluation including accuracy, sensitivity, specificity, AUC, and clinical metrics.
show_validation_curves: Display learning curves and validation performance. Helps assess overfitting and training adequacy.
show_roc_curve: Display ROC curve analysis for binary classification. Essential for clinical decision making and threshold selection.
show_calibration_plot: Display probability calibration plot. Important for clinical applications requiring reliable probability estimates.
show_confusion_matrix: Display detailed confusion matrix with clinical interpretations. Shows actual vs predicted classifications.
interpretability: Perform advanced interpretability analysis including SHAP values, partial dependence plots, and interaction effects.
shap_analysis: Calculate SHAP (SHapley Additive exPlanations) values for individual prediction explanations. Powerful for clinical decision support.
partial_dependence: Show how individual features affect predictions across their value ranges. Helps understand clinical relationships.
interaction_analysis: Analyze interactions between clinical variables. Important for understanding combined effects of biomarkers.
clinical_context: Clinical application context. Affects performance thresholds, interpretation guidelines, and visualization emphasis.
cost_sensitive_thresholds: Optimize decision thresholds considering clinical costs of false positives vs false negatives.
fn_fp_ratio: Relative cost of missing a positive case vs false alarm. Screening (high ratio), confirmation tests (low ratio).
missing_data_handling: Strategy for handling missing data. Tree-based methods can handle missing values naturally in some algorithms.
export_model: Export the trained model for external use or deployment. Useful for clinical decision support system integration.
bootstrap_confidence: Calculate bootstrap confidence intervals for performance metrics. Provides uncertainty quantification for clinical reporting.
n_bootstrap: Number of bootstrap samples for confidence interval calculation. More samples provide better estimates but increase computation time.

Value

A results object containing:

`results$todo`					a html
`results$model_summary`					a html
`results$tree_plot`					an image
`results$importance_plot`					an image
`results$performance_table`					a table
`results$roc_plot`					an image
`results$validation_curves`					an image
`results$confusion_matrix`					a html
`results$calibration_plot`					an image
`results$shap_plot`					an image
`results$partial_dependence_plot`					an image
`results$interaction_plot`					an image
`results$hyperparameter_results`					a html
`results$clinical_interpretation`					a html
`results$feature_selection_results`					a html
`results$bootstrap_intervals`					a html
`results$model_export`					a html

Tables can be converted to data frames with asDF or as.data.frame. For example:

results$performance_table$asDF

as.data.frame(results$performance_table)

Examples

# Example: Advanced decision tree for cancer prognosis
data(cancer_data)
#> Warning: data set ‘cancer_data’ not found
advancedtree(
    data = cancer_data,
    vars = c("age", "tumor_size", "biomarker1", "biomarker2"),
    facs = c("grade", "stage", "histology"),
    target = "outcome",
    targetLevel = "poor_prognosis",
    algorithm = "xgboost",
    validation = "cv",
    interpretability = TRUE
)
#> Error: Argument 'vars' contains 'tumor_size', 'biomarker1', 'biomarker2' which are not present in the dataset