Analyzing Diagnostic Tests without a Gold Standard

Introduction

In clinical practice and research, evaluating diagnostic test performance typically requires a “gold standard” reference test to determine the true disease status. However, in many situations, a perfect gold standard test is unavailable, expensive, invasive, or unethical to perform. The nogoldstandard module in the ClinicoPath package provides methods for analyzing the performance of multiple diagnostic tests without requiring a gold standard reference.

This vignette demonstrates how to use the nogoldstandard module to:

Estimate disease prevalence
Calculate sensitivity and specificity for multiple tests
Compute confidence intervals using bootstrap methods
Visualize test agreement patterns

Why analyze tests without a gold standard?

When no perfect reference test exists, researchers typically face challenging options:

Use an imperfect reference (introducing verification bias)
Employ composite reference standards (potentially arbitrary)
Exclude subjects without verification (introducing selection bias)

Statistical methods for “no gold standard” analysis offer an alternative approach, allowing estimation of test performance metrics by using the patterns of agreement and disagreement among multiple imperfect tests.

Methods Overview

The nogoldstandard module implements five different approaches:

Latent Class Analysis (LCA): Assumes disease status is an unobserved (latent) variable and models the relationship between this latent variable and the observed test results.
Composite Reference: Creates a reference standard by considering the majority result across all tests as the “true” status.
All Tests Positive: Considers disease present only when all tests are positive (high specificity approach).
Any Test Positive: Considers disease present when any test is positive (high sensitivity approach).
Bayesian Analysis: Uses prior distributions and an EM algorithm to estimate parameters, potentially incorporating prior knowledge.

Installation

The nogoldstandard module is part of the ClinicoPath package:

# Install from CRAN
install.packages("ClinicoPath")

# Or the development version
# install.packages("devtools")
devtools::install_github("sbalci/ClinicoPath")

Basic Usage

Example Data

Let’s create a sample dataset with 4 diagnostic tests performed on 200 patients:

set.seed(123)
n <- 200  # Number of patients

# True disease status (unknown in practice)
true_status <- rbinom(n, 1, 0.3)  # 30% prevalence

# Create 4 imperfect tests with different sensitivity/specificity
test1 <- ifelse(true_status == 1, 
                rbinom(n, 1, 0.90),  # 90% sensitivity
                rbinom(n, 1, 0.10))  # 90% specificity

test2 <- ifelse(true_status == 1, 
                rbinom(n, 1, 0.75),  # 75% sensitivity
                rbinom(n, 1, 0.05))  # 95% specificity

test3 <- ifelse(true_status == 1, 
                rbinom(n, 1, 0.85),  # 85% sensitivity
                rbinom(n, 1, 0.15))  # 85% specificity

test4 <- ifelse(true_status == 1, 
                rbinom(n, 1, 0.70),  # 70% sensitivity
                rbinom(n, 1, 0.05))  # 95% specificity

# Convert to categorical format
test1 <- factor(ifelse(test1 == 1, "pos", "neg"))
test2 <- factor(ifelse(test2 == 1, "pos", "neg"))
test3 <- factor(ifelse(test3 == 1, "pos", "neg"))
test4 <- factor(ifelse(test4 == 1, "pos", "neg"))

# Create data frame
data <- data.frame(
  caseID = 1:n,
  test1 = test1,
  test2 = test2,
  test3 = test3,
  test4 = test4
)

# View the first few rows
head(data)
#>   caseID test1 test2 test3 test4
#> 1      1   pos   neg   pos   neg
#> 2      2   neg   pos   pos   pos
#> 3      3   pos   neg   neg   neg
#> 4      4   pos   pos   neg   pos
#> 5      5   pos   pos   pos   pos
#> 6      6   neg   pos   pos   neg

Running the Analysis

Now let’s analyze this data using the Latent Class Analysis method:

library(ClinicoPath)

result <- nogoldstandard(
  data = data,
  test1 = "test1",
  test1Positive = "pos",
  test2 = "test2",
  test2Positive = "pos",
  test3 = "test3",
  test3Positive = "pos",
  test4 = "test4",
  test4Positive = "pos",
  test5 = NULL,
  method = "latent_class"
)

The result will contain:

Estimated disease prevalence
Sensitivity and specificity for each test
Test agreement matrix visualization

Bootstrap Confidence Intervals

One of the key features of the nogoldstandard module is the ability to compute confidence intervals using bootstrap resampling. This provides a measure of uncertainty around the estimated parameters.

Understanding Bootstrap in NoGoldStandard Analysis

Bootstrap resampling involves:

Randomly selecting observations from the original dataset with replacement
Running the selected analysis method on each resampled dataset
Computing the parameter of interest (prevalence, sensitivity, specificity)
Repeating steps 1-3 many times to build a distribution
Using the distribution to calculate confidence intervals

Enabling Bootstrap with Progress Reporting

The enhanced version of the nogoldstandard module includes detailed progress reporting during bootstrap analysis:

result_with_ci <- nogoldstandard(
  data = data,
  test1 = "test1",
  test1Positive = "pos",
  test2 = "test2",
  test2Positive = "pos",
  test3 = "test3",
  test3Positive = "pos",
  test4 = "test4",
  test4Positive = "pos",
  method = "latent_class",
  bootstrap = TRUE,     # Enable bootstrap
  nboot = 1000,         # Number of bootstrap samples
  alpha = 0.05          # For 95% confidence intervals
)

When running this analysis, you’ll see progress updates in the console:

=== Bootstrap Analysis ===
Starting bootstrap with 1000 iterations for latent_class method
Estimating confidence intervals for prevalence

  50/1000 (5.0%) - 50 successful, 0 errors - 12.3 sec elapsed, ~234.7 sec remaining
  100/1000 (10.0%) - 100 successful, 0 errors - 24.8 sec elapsed, ~223.2 sec remaining
  ...
  950/1000 (95.0%) - 942 successful, 8 errors - 236.5 sec elapsed, ~12.5 sec remaining
  1000/1000 (100.0%) - 991 successful, 9 errors - 249.3 sec elapsed, ~0.0 sec remaining

=== Bootstrap Complete ===
Total time: 249.3 seconds (4.01 iterations/sec)
Successful iterations: 991 (99.1%)
Failed iterations: 9 (0.9%)
Confidence interval (95.0%): [0.2145, 0.2987]

Expected Duration for Bootstrap Analysis

The time required for bootstrap analysis depends on several factors:

Dataset Size	100 Iterations	1,000 Iterations	10,000 Iterations
Small (<100 obs, 2-3 tests)	5-30 sec	30 sec - 5 min	5-50 min
Medium (100-1,000 obs, 3-4 tests)	30 sec - 2 min	3-20 min	30 min - 3 hrs
Large (>1,000 obs, 5 tests)	1-5 min	10-60 min	1-8 hrs

The analysis method also affects duration: - Latent Class Analysis: Slowest (multiple model fitting attempts) - Bayesian Analysis: Moderate (iterative EM algorithm) - Composite/All/Any Test: Fastest (simple calculations)

Recommended Bootstrap Iterations

For Quick Exploration: 100-500 bootstrap samples
For Publication-Quality Results: 1,000-2,000 bootstrap samples
For Critical Applications: 5,000-10,000 bootstrap samples

Comparing Different Methods

Let’s compare the results from different analysis methods:

# Run analysis with each method
methods <- c("latent_class", "composite", "all_positive", "any_positive", "bayesian")
results <- list()

for (method in methods) {
  results[[method]] <- nogoldstandard(
    data = data,
    test1 = "test1",
    test1Positive = "pos",
    test2 = "test2",
    test2Positive = "pos",
    test3 = "test3",
    test3Positive = "pos",
    test4 = "test4",
    test4Positive = "pos",
    method = method
  )
}

# Extract prevalence estimates
prevalence_estimates <- sapply(results, function(x) {
  x$prevalence$asDF()$estimate[1]
})
print(prevalence_estimates)

Different methods may produce different estimates. In general:

Latent Class Analysis: Often considered the most theoretically sound but requires assumptions about conditional independence
Composite Reference: Practical but can be biased toward majority
All Tests Positive: Conservative approach (low prevalence, high test specificity)
Any Test Positive: Liberal approach (high prevalence, high test sensitivity)
Bayesian Analysis: Flexible and can incorporate prior knowledge, but requires careful prior specification

Interpreting Results

Disease Prevalence

The estimated prevalence represents the proportion of the population expected to have the disease. Different methods will yield different prevalence estimates:

All Tests Positive: Typically produces the lowest prevalence estimate
Any Test Positive: Typically produces the highest prevalence estimate
Latent Class Analysis: Usually produces an intermediate estimate

Test Performance

For each test, the module estimates:

Sensitivity: The probability of a positive test result in patients with the disease
Specificity: The probability of a negative test result in patients without the disease

Sensitivity and specificity can be used to:

Compare test performance
Inform test selection decisions
Design optimal testing strategies

Agreement Plot

The test agreement plot shows the proportion of agreement between each pair of tests. This visualization helps identify:

Tests that frequently agree/disagree
Potential conditional dependence between tests
Clusters of similar tests

Advanced Topics

Handling Missing Data

The nogoldstandard module can handle missing test results. The enhanced implementation:

Skips missing values when calculating agreement
Uses available data for each test when estimating parameters
Properly accounts for missing data in bootstrap resampling

Bayesian Analysis with Prior Information

If you have prior knowledge about disease prevalence or test characteristics, the Bayesian method can incorporate this information:

# Example of Bayesian analysis with informative priors would be shown here
# This feature would require customization of the Bayesian method code

Conditional Dependence

The default Latent Class Analysis assumes conditional independence between tests given the true disease status. If this assumption is violated, results may be biased. Extensions to handle conditional dependence are:

Including direct test-to-test associations in the model
Using more than two latent classes
Applying hierarchical latent class models

Recommendations for Practice

Method Selection

Start with Latent Class Analysis as the primary method
Compare with composite methods to check robustness
Consider Bayesian analysis if prior information is available
Report results from multiple methods for transparency

Bootstrap Settings

Use at least 1,000 bootstrap iterations for published results
For time-consuming methods (LCA), start with 100 iterations to check feasibility
Monitor progress with the enhanced bootstrap implementation
Report both point estimates and confidence intervals

Validation

When possible, validate results against:

External validation data
Follow-up studies
Alternative gold standard tests on a subset of patients
Results from literature or clinical expertise

Technical Implementation

The Enhanced Bootstrap Function

Below is the implementation of the bootstrap function with progress reporting:

.calculateBootstrapCI = function(data, method, nboot, alpha, type, test_index = NULL) {
    # Simple bootstrap implementation with progress indicators
    n <- nrow(data)
    boot_results <- numeric(nboot)
    
    # Show starting message
    message("\n=== Bootstrap Analysis ===")
    message(sprintf("Starting bootstrap with %d iterations for %s method", nboot, method))
    message(sprintf("Estimating confidence intervals for %s", type))
    if (!is.null(test_index)) {
        message(sprintf("Test index: %d", test_index))
    }
    
    # Progress tracking variables
    start_time <- Sys.time()
    last_update <- start_time
    update_interval <- max(1, floor(nboot / 20))  # Update ~20 times during process
    success_count <- 0
    error_count <- 0
    
    for (b in 1:nboot) {
        # Resample data
        boot_indices <- sample(n, n, replace = TRUE)
        boot_data <- data[boot_indices, ]
        
        # Run analysis on bootstrap sample
        boot_result <- NULL
        tryCatch({
            if (method == "latent_class") {
                boot_result <- private$.runLCA(boot_data, names(data), NULL)
            } else if (method == "composite") {
                boot_result <- private$.runComposite(boot_data)
            } else if (method == "all_positive") {
                boot_result <- private$.runAllPositive(boot_data)
            } else if (method == "any_positive") {
                boot_result <- private$.runAnyPositive(boot_data)
            } else if (method == "bayesian") {
                boot_result <- private$.runBayesian(boot_data)
            }
            success_count <- success_count + 1
        }, error = function(e) {
            # Count errors but continue bootstrap
            error_count <- error_count + 1
        })
        
        # Extract relevant statistic
        if (!is.null(boot_result)) {
            if (type == "prevalence") {
                boot_results[b] <- boot_result$prevalence
            } else if (type == "sensitivity" && !is.null(test_index)) {
                boot_results[b] <- boot_result$sensitivities[test_index]
            } else if (type == "specificity" && !is.null(test_index)) {
                boot_results[b] <- boot_result$specificities[test_index]
            }
        } else {
            boot_results[b] <- NA
        }
        
        # Show progress updates
        current_time <- Sys.time()
        if (b %% update_interval == 0 || b == nboot || 
            as.numeric(difftime(current_time, last_update, units = "secs")) > 10) {
            elapsed <- as.numeric(difftime(current_time, start_time, units = "secs"))
            percent_done <- b / nboot * 100
            est_total <- elapsed / percent_done * 100
            est_remaining <- est_total - elapsed
            
            message(sprintf("  %d/%d (%.1f%%) - %d successful, %d errors - %.1f sec elapsed, ~%.1f sec remaining", 
                          b, nboot, percent_done, success_count, error_count,
                          elapsed, est_remaining))
            
            last_update <- current_time
        }
    }
    
    # Show final statistics
    total_time <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))
    message("\n=== Bootstrap Complete ===")
    message(sprintf("Total time: %.1f seconds (%.2f iterations/sec)", 
                  total_time, nboot/total_time))
    message(sprintf("Successful iterations: %d (%.1f%%)", 
                  success_count, success_count/nboot*100))
    message(sprintf("Failed iterations: %d (%.1f%%)", 
                  error_count, error_count/nboot*100))
    
    # Calculate percentile CI
    boot_results <- boot_results[!is.na(boot_results)]
    
    if (length(boot_results) > 0) {
        ci <- quantile(boot_results, c(alpha/2, 1-alpha/2), na.rm=TRUE)
        message(sprintf("Confidence interval (%.1f%%): [%.4f, %.4f]", 
                      (1-alpha)*100, ci[1], ci[2]))
        return(list(lower = ci[1], upper = ci[2]))
    } else {
        message("WARNING: No valid bootstrap results obtained. Returning NA.")
        return(list(lower = NA, upper = NA))
    }
}

References

Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1):167-171.
Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3):263-272.
Albert PS, Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics. 2004;60(2):427-435.
Collins J, Huynh M. Estimation of diagnostic test accuracy without full verification: a review of latent class methods. Stat Med. 2014;33(24):4141-4169.
Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1):158-167.

Session Information

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS 15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Istanbul
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.37     desc_1.4.3        R6_2.6.1          fastmap_1.2.0    
#>  [5] xfun_0.52         cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1
#>  [9] rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.5         sass_0.4.10      
#> [13] pkgdown_2.1.3     textshaping_1.0.1 jquerylib_0.1.4   systemfonts_1.2.3
#> [17] compiler_4.3.2    rstudioapi_0.17.1 tools_4.3.2       ragg_1.4.0       
#> [21] bslib_0.9.0       evaluate_1.0.3    yaml_2.3.10       jsonlite_2.0.0   
#> [25] rlang_1.1.6       fs_1.6.6          htmlwidgets_1.6.4

ClinicoPath Development Team

2025-06-03