decisioncompare Inter-Rater Data — decisioncompare

Dataset with 150 patients evaluating three raters (experienced, moderate, junior) against consensus panel gold standard.

Usage

decisioncompare_raters

Format

A data frame with 150 rows and 5 variables:

patient_id: Character: Patient identifier (PT001-PT150)
ConsensusPanel: Factor: Consensus diagnosis ("No Disease", "Disease"), 32% disease
Rater1: Factor: Experienced rater ("No Disease", "Disease"), Sens=0.88, Spec=0.90
Rater2: Factor: Moderate experience ("No Disease", "Disease"), Sens=0.82, Spec=0.85
Rater3: Factor: Junior rater ("No Disease", "Disease"), Sens=0.75, Spec=0.82
case_difficulty: Factor: Case difficulty (Easy, Moderate, Difficult)

Source

Generated test data for ClinicoPath package

Details

Inter-rater reliability study with varying expertise levels. Demonstrates performance comparison across different experience levels.

Examples

data(decisioncompare_raters)
decisioncompare(data = decisioncompare_raters, gold = "ConsensusPanel",
                goldPositive = "Disease", test1 = "Rater1",
                test1Positive = "Disease", test2 = "Rater2",
                test2Positive = "Disease", test3 = "Rater3",
                test3Positive = "Disease", statComp = TRUE)
#> Performing 3 pairwise comparisons with Holm-Bonferroni correction...
#> Statistical comparisons completed (3 comparisons, Holm-Bonferroni corrected).
#> 
#>  COMPARE MEDICAL DECISION TESTS
#> 
#> character(0)
#> 
#>  Test 1 - Recoded Data                                            
#>  ──────────────────────────────────────────────────────────────── 
#>                     Gold Positive    Gold Negative    Total       
#>  ──────────────────────────────────────────────────────────────── 
#>    Test Positive        35.000000         9.000000     44.00000   
#>    Test Negative         4.000000       102.000000    106.00000   
#>    Total                39.000000       111.000000    150.00000   
#>  ──────────────────────────────────────────────────────────────── 
#> 
#> 
#>  Test 2 - Recoded Data                                            
#>  ──────────────────────────────────────────────────────────────── 
#>                     Gold Positive    Gold Negative    Total       
#>  ──────────────────────────────────────────────────────────────── 
#>    Test Positive         29.00000         11.00000     40.00000   
#>    Test Negative         10.00000        100.00000    110.00000   
#>    Total                 39.00000        111.00000    150.00000   
#>  ──────────────────────────────────────────────────────────────── 
#> 
#> 
#>  Test 3 - Recoded Data                                            
#>  ──────────────────────────────────────────────────────────────── 
#>                     Gold Positive    Gold Negative    Total       
#>  ──────────────────────────────────────────────────────────────── 
#>    Test Positive         29.00000         12.00000     41.00000   
#>    Test Negative         10.00000         99.00000    109.00000   
#>    Total                 39.00000        111.00000    150.00000   
#>  ──────────────────────────────────────────────────────────────── 
#> 
#> 
#>  Decision Test Comparison                                                                                                                                                                                                                               
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#>    Test                                                                                   Sensitivity    Specificity    Accuracy     Positive Predictive Value    Negative Predictive Value    Positive Likelihood Ratio    Negative Likelihood Ratio   
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#>    Rater1                                                                                    89.74359       91.89189     91.33333                     79.54545                     96.22642                    11.068376                    0.1116139   
#>      → Good balanced performance; Strong positive evidence; Moderate negative evidence                                                                                                                                                                  
#>    Rater2                                                                                    74.35897       90.09009     86.00000                     72.50000                     90.90909                     7.503497                    0.2846154   
#>      → Good specificity for confirmation; Moderate positive evidence                                                                                                                                                                                    
#>    Rater3                                                                                    74.35897       89.18919     85.33333                     70.73171                     90.82569                     6.878205                    0.2874903   
#>      → Good specificity for confirmation; Moderate positive evidence                                                                                                                                                                                    
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#> 
#> 
#>  Statistical Comparison of Test Accuracy                                                                                    
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#>    Comparison           Chi-squared    df    p-value      Clinical Interpretation                                           
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#>    Overall (3 tests)       2.979592     2    0.2254187    No significant overall difference among tests (p≥0.05)        ᵃ   
#>    Rater1 vs Rater2        1.531250     1    0.4318499    No significant difference (p≥0.1) (Holm-Bonferroni corrected)     
#>    Rater1 vs Rater3        2.206897     1    0.4121845    No significant difference (p≥0.1) (Holm-Bonferroni corrected)     
#>    Rater2 vs Rater3        0.000000     1    1.0000000    No significant difference (p≥0.1) (Holm-Bonferroni corrected)     
#>  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
#>    Note. For 2 tests: McNemar's test compares diagnostic CORRECTNESS (agreement with gold standard) between paired
#>    tests. For 3+ tests: Cochran's Q test provides an overall test, followed by pairwise McNemar's tests with
#>    Holm-Bonferroni correction for multiple comparisons. Tests examine discordant pairs (cases where one test is correct
#>    and the other is wrong relative to the gold standard) to determine if differences in accuracy are statistically
#>    significant.
#>    ᵃ Cochran's Q test shows no significant difference. Pairwise comparisons below may not be meaningful.
#> 
#> 
#>  Differences with 95% Confidence Intervals                                    
#>  ──────────────────────────────────────────────────────────────────────────── 
#>    Comparison          Metric         Difference     Lower        Upper       
#>  ──────────────────────────────────────────────────────────────────────────── 
#>    Rater1 vs Rater2    Sensitivity     15.38462 ᵃ     -1.34142     32.11065   
#>    Rater1 vs Rater2    Specificity      1.80180 ᵇ     -6.08768      9.69128   
#>    Rater1 vs Rater2    Accuracy         5.33333 ᵈ     -2.00871     12.67538   
#>    Rater1 vs Rater3    Sensitivity     15.38462 ᵉ      0.24368     30.52555   
#>    Rater1 vs Rater3    Specificity      2.70270 ᶠ     -4.97751     10.38292   
#>    Rater1 vs Rater3    Accuracy         6.00000 ᵍ     -0.97067     12.97067   
#>    Rater2 vs Rater3    Sensitivity      0.00000 ʰ    -20.10219     20.10219   
#>    Rater2 vs Rater3    Specificity      0.90090 ⁱ     -7.18897      8.99077   
#>    Rater2 vs Rater3    Accuracy         0.66667 ʲ     -7.28061      8.61395   
#>  ──────────────────────────────────────────────────────────────────────────── 
#>    ᵃ Small paired sample/discordant counts; CI may be unstable (n=39,
#>    discordant counts: 26, 9, 3, 1).
#>    ᵇ Small paired sample/discordant counts; CI may be unstable (n=111,
#>    discordant counts: 91, 11, 9, 0).
#>    ᵈ Small paired sample/discordant counts; CI may be unstable (n=150,
#>    discordant counts: 117, 20, 12, 1).
#>    ᵉ Small paired sample/discordant counts; CI may be unstable (n=39,
#>    discordant counts: 27, 8, 2, 2).
#>    ᶠ Small paired sample/discordant counts; CI may be unstable (n=111,
#>    discordant counts: 91, 11, 8, 1).
#>    ᵍ Small paired sample/discordant counts; CI may be unstable (n=150,
#>    discordant counts: 118, 19, 10, 3).
#>    ʰ Small paired sample/discordant counts; CI may be unstable (n=39,
#>    discordant counts: 21, 8, 8, 2).
#>    ⁱ Small paired sample/discordant counts; CI may be unstable (n=111,
#>    discordant counts: 89, 11, 10, 1).
#>    ʲ Small paired sample/discordant counts; CI may be unstable (n=150,
#>    discordant counts: 110, 19, 18, 3).
#> 
#> 
#>  <div style="font-family: Arial, sans-serif; max-width: 800px; margin:
#>  0 auto; padding: 20px;"><h2 style="color: #2c3e50; border-bottom: 2px
#>  solid #3498db;">📋 Clinical Summary
#> 
#>  Among the tests evaluated, Rater1 demonstrated optimal diagnostic
#>  performance with 89.7% sensitivity (95% CI: [see confidence interval
#>  table]), 91.9% specificity (95% CI: [see confidence interval table]),
#>  79.5% positive predictive value, 96.2% negative predictive value, and
#>  91.3% overall accuracy. Statistical comparisons using McNemar's test
#>  revealed significant differences in test performance (detailed results
#>  in comparison tables). The likelihood ratio for positive results was
#>  11.07 and for negative results was 0.11.<h3 style="color: #27ae60;
#>  margin-top: 30px;">📝 Report Sentences
#> 
#>  <div style="background-color: #f8f9fa; padding: 15px; border-left: 4px
#>  solid #28a745; margin: 15px 0;"><h4 style="margin-top: 0;">Methods
#>  Section:
#> 
#>  <p style="font-style: italic; line-height: 1.6;">We compared the
#>  diagnostic performance of 3 tests (Rater1, Rater2, Rater3) against the
#>  gold standard reference using diagnostic accuracy analysis. The study
#>  included 150 cases with complete data. Performance metrics calculated
#>  included sensitivity, specificity, positive and negative predictive
#>  values, likelihood ratios, and overall accuracy. Statistical
#>  comparisons between tests were performed using McNemar's test
#>  comparing diagnostic correctness (agreement with gold standard).
#> 
#>  <div style="background-color: #e8f4f8; padding: 15px; border-left: 4px
#>  solid #3498db; margin: 15px 0;"><h4 style="margin-top: 0;">Results
#>  Section:
#> 
#>  <p style="font-style: italic; line-height: 1.6;">Among the tests
#>  evaluated, Rater1 demonstrated optimal diagnostic performance with
#>  89.7% sensitivity (95% CI: [see confidence interval table]), 91.9%
#>  specificity (95% CI: [see confidence interval table]), 79.5% positive
#>  predictive value, 96.2% negative predictive value, and 91.3% overall
#>  accuracy. Statistical comparisons using McNemar's test revealed
#>  significant differences in test performance (detailed results in
#>  comparison tables). The likelihood ratio for positive results was
#>  11.07 and for negative results was 0.11.
#> 
#>  <h3 style="color: #8e44ad; margin-top: 30px;">💡 Clinical
#>  Recommendations
#> 
#>  <div style="background-color: #fff3cd; padding: 15px; border-radius:
#>  8px;">
#> 
#>  Clinical Consideration: Consider using Rater1 in combination with
#>  other tests for optimal diagnostic accuracy.
#> 
#>  Implementation Note: Results should be interpreted in the context of
#>  disease prevalence in your clinical population. Consider local
#>  validation studies before implementation.
#> 
#>  <div style="font-family: Arial, sans-serif; max-width: 900px; margin:
#>  0 auto; padding: 20px;"><h2 style="color: #2c3e50; text-align: center;
#>  border-bottom: 2px solid #3498db; padding-bottom: 10px;">🔬 About
#>  Medical Decision Test Comparison
#> 
#>  <div style="background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb
#>  100%); padding: 20px; border-radius: 10px; margin: 20px 0;"><h3
#>  style="color: #1565c0; margin-top: 0;">📊 What This Analysis Does
#> 
#>  <p style="line-height: 1.6; color: #333;">This tool compares the
#>  diagnostic performance of multiple medical tests against a gold
#>  standard reference. It systematically evaluates sensitivity,
#>  specificity, predictive values, likelihood ratios, and overall
#>  accuracy to help you determine which test performs best for your
#>  clinical scenario.
#> 
#>  <div style="background-color: #f1f8e9; border: 1px solid #8bc34a;
#>  padding: 20px; border-radius: 8px; margin: 20px 0;"><h3 style="color:
#>  #4a7c59; margin-top: 0;">🎯 When to Use This Analysis
#> 
#>  <ul style="line-height: 1.8; color: #4a7c59;">Test Validation:
#>  Comparing new diagnostic methods against established standardsMethod
#>  Comparison: Evaluating which of several tests performs betterClinical
#>  Research: Validating biomarkers, imaging techniques, or clinical
#>  assessmentsQuality Assessment: Measuring agreement between different
#>  raters or methodsProtocol Development: Optimizing diagnostic
#>  workflows<div style="background-color: #fff3e0; border: 1px solid
#>  #ff9800; padding: 20px; border-radius: 8px; margin: 20px 0;"><h3
#>  style="color: #e65100; margin-top: 0;">📝 How to Use This Analysis
#> 
#>  <ol style="line-height: 1.8; color: #e65100;">Select Gold Standard:
#>  Choose your most reliable reference test (e.g., biopsy, expert
#>  consensus)Choose Tests to Compare: Select 2-3 diagnostic tests you
#>  want to evaluateDefine Positive Levels: Specify what constitutes a
#>  "positive" result for each testConfigure Options: Enable statistical
#>  comparisons, confidence intervals, or visualizations as neededRun
#>  Analysis: Review results tables and clinical interpretationsCopy
#>  Report: Use the auto-generated sentences for your documentation<div
#>  style="background-color: #f3e5f5; border: 1px solid #9c27b0; padding:
#>  20px; border-radius: 8px; margin: 20px 0;"><h3 style="color: #6a1b9a;
#>  margin-top: 0;">📈 Key Metrics Explained
#> 
#>  <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 15px;
#>  color: #6a1b9a;">
#> 
#>  Sensitivity: Probability test is positive when disease present
#>  (rule-out ability)
#> 
#>  Specificity: Probability test is negative when disease absent (rule-in
#>  ability)
#> 
#>  PPV: Probability of disease when test positive
#> 
#>  NPV: Probability of no disease when test negative
#> 
#>  LR+: How much positive test increases odds of disease
#> 
#>  LR-: How much negative test decreases odds of disease
#> 
#>  Accuracy: Overall probability of correct classification
#> 
#>  McNemar Test: Statistical comparison between paired tests
#> 
#>  <div style="background-color: #e8f5e8; border: 1px solid #4caf50;
#>  padding: 20px; border-radius: 8px; margin: 20px 0;"><h3 style="color:
#>  #2e7d32; margin-top: 0;">⚕️ Clinical Interpretation Guidelines
#> 
#>  <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 15px;
#>  color: #2e7d32;"><h4 style="margin-bottom: 5px;">Screening Tests
#>  (Rule-Out):
#> 
#>  <p style="margin-top: 0;">• Sensitivity ≥95%: Excellent
#>  • NPV ≥95%: High confidence
#>  • Goal: Minimize false negatives
#> 
#>  <h4 style="margin-bottom: 5px;">Confirmatory Tests (Rule-In):
#> 
#>  <p style="margin-top: 0;">• Specificity ≥95%: Excellent
#>  • PPV ≥90%: High confidence
#>  • Goal: Minimize false positives
#> 
#>  <div style="background-color: #fff8e1; border: 1px solid #ffc107;
#>  padding: 20px; border-radius: 8px; margin: 20px 0;"><h3 style="color:
#>  #f57f17; margin-top: 0;">⚠️ Important Assumptions & Limitations
#> 
#>  <ul style="line-height: 1.6; color: #f57f17;">Gold Standard: Assumes
#>  your reference test is truly accurateSample Size: Results more
#>  reliable with larger, representative samplesPrevalence Dependency: PPV
#>  and NPV vary with disease prevalenceMcNemar Test: Requires
#>  paired/matched data for statistical comparisonsMissing Data: Cases
#>  with incomplete data are excluded from analysisConfidence Intervals:
#>  Calculated using Wilson method for better accuracy
#> 
#>  <div style='margin: 10px 0;'><div style='background-color: #eff6ff;
#>  border-left: 4px solid #93c5fd; padding: 12px; margin: 8px 0;
#>  border-radius: 4px;'><strong style='color: #2563eb;'>ℹ️ Analysis
#>  Completed Successfully
#>  <span style='color: #374151;'>3 diagnostic tests compared using 150
#>  complete cases. Gold standard identified 39 diseased and 111 healthy
#>  cases. Review comparison tables and statistical tests below.