Multi-class Diagnostic Performance Evaluation
Comprehensive evaluation of multi-class classification models
meddecide
2025-10-09
Source:vignettes/meddecide-15-multiclass-diagnostics-comprehensive.Rmd
meddecide-15-multiclass-diagnostics-comprehensive.Rmd
Overview
The Multi-class Diagnostic Performance Evaluation function provides comprehensive assessment of multi-class classification models, addressing the need for thorough evaluation metrics identified in machine learning and diagnostic research.
This function was developed in response to analysis of thyroid cytology classification research, where comprehensive multi-class diagnostic evaluation is essential for medical AI applications.
Key Features
Per-Class Metrics
- Sensitivity (Recall) for each class
- Specificity for each class
- Positive Predictive Value (PPV)
- Negative Predictive Value (NPV)
- F1 Score
- Youden Index
- Area Under the Curve (AUC) via one-vs-rest approach
Overall Performance
- Overall accuracy with confidence intervals
- Cohen’s Kappa for chance-corrected agreement
- Weighted averages of sensitivity, specificity, and F1 scores
Example Applications
Statistical Methods
The function implements established diagnostic evaluation methods:
- One-vs-Rest ROC Analysis: Each class treated as positive against all others
- Confusion Matrix Analysis: Complete classification accuracy breakdown
- Diagnostic Metrics: Standard 2×2 table metrics extended to multi-class
- Statistical Comparisons: Appropriate tests for paired model evaluation
Data Requirements
- Predicted Classes: Categorical variable with model predictions
- Actual Classes: Categorical variable with true labels
- Optional Second Model: For model comparison analysis
Interpretation Guidelines
AUC Values
- 0.9-1.0: Excellent discrimination
- 0.8-0.9: Good discrimination
- 0.7-0.8: Fair discrimination
- 0.6-0.7: Poor discrimination
- 0.5-0.6: Fail (no discriminative ability)
References
This implementation follows guidelines from: - DeLong, E.R., DeLong, D.M., Clarke-Pearson, D.L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves - McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions - Youden, W.J. (1950). Index for rating diagnostic tests - Cohen, J. (1960). A coefficient of agreement for nominal scales