Data Quality Assessment - Enhanced Visual Analysis with visdat Integration
Comprehensive guide to the ClinicoPath dataquality function for clinical research data validation
ClinicoPath
2025-07-13
Source:vignettes/clinicopath-descriptives-05-dataquality-comprehensive.Rmd
clinicopath-descriptives-05-dataquality-comprehensive.Rmd
Introduction to Enhanced Data Quality Assessment
The dataquality
function in ClinicoPath provides
sophisticated data quality assessment capabilities specifically designed
for clinical research. Unlike general data quality tools, this function
integrates advanced visual analysis using the visdat
package, offering researchers a comprehensive suite of tools for data
validation, pattern detection, and quality assurance.
Why Enhanced Data Quality Assessment?
Clinical Research Requirements
- Regulatory Compliance: Meet ICH E6 and FDA data quality standards
- Patient Safety: Ensure treatment decisions are based on reliable data
- Research Integrity: Validate data before statistical analysis
- Publication Quality: Enhance credibility of research findings
Advanced Features
- Visual Data Exploration: Integration with visdat for comprehensive visual analysis
- Multiple Analysis Types: Missing patterns, duplicates, completeness analysis
- Flexible Configuration: Customizable thresholds and analysis options
- Export Capabilities: Professional plots ready for publication
What Makes dataquality Different?
Unique Capabilities
- visdat Integration: First jamovi module to integrate visdat visual analysis
- Clinical Context: Designed specifically for medical research workflows
- Comprehensive Analysis: Combined statistical and visual assessment
- Flexible Outputs: HTML reports with interactive insights
Analysis Types Available
- Data Overview (vis_dat): Variable types and missing value patterns
- Missing Patterns (vis_miss): Missing value clustering and relationships
- Data Types (vis_guess): Automatic type detection and validation
- Value Expectations (vis_expect): Expected vs actual data patterns
- Comprehensive Analysis: All visual analyses combined
Getting Started
Load Required Libraries and Data
library(ClinicoPath)
library(dplyr)
library(knitr)
# Load clinical research datasets
data("histopathology")
data("treatmentResponse")
# Display overview of available datasets
cat("📊 Clinical Research Datasets Available:\n")
## 📊 Clinical Research Datasets Available:
## - histopathology: Pathological research data ( 250 patients)
if(exists("treatmentResponse")) {
cat(" - treatmentResponse: Treatment outcome data (", nrow(treatmentResponse), " patients)\n")
}
## - treatmentResponse: Treatment outcome data ( 250 patients)
Basic Data Quality Workflow
The enhanced data quality assessment workflow follows these systematic steps:
- Variable Selection: Choose variables for quality assessment
- Analysis Configuration: Select missing data, duplicates, visual analysis options
- Visual Analysis Type: Choose specific visdat analysis type
- Threshold Setting: Configure missing data highlight thresholds
- Export Options: Enable plot export for publications
- Execution: Run comprehensive quality assessment
- Interpretation: Review results and recommendations
Core Examples and Applications
Example 1: Basic Data Quality Assessment
Starting with a simple quality assessment of core clinical variables.
# Basic data quality assessment
basic_quality <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = FALSE, # Start without visual analysis
visdat_type = "vis_dat",
missing_threshold_visual = 10,
export_plots = FALSE
)
# View basic quality results
print("📋 Basic Data Quality Assessment:")
print("================================")
basic_quality$text
Expected Results: - Missing data analysis for Age, Sex, and Grade variables - Duplicate value detection within each variable - Data completeness summary - Recommendations based on quality findings
Example 2: Enhanced Visual Data Quality Assessment
Using visdat integration for comprehensive visual analysis.
# Enhanced visual data quality assessment
visual_quality <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "TStage", "LVI", "PNI"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE, # Enable visual analysis
visdat_type = "vis_dat", # Data overview visualization
missing_threshold_visual = 5, # Highlight variables with >5% missing
export_plots = FALSE
)
# View visual quality results
print("👁️ Visual Data Quality Assessment:")
print("=================================")
print("Text Summary:")
visual_quality$text
print("\nVisual Analysis:")
visual_quality$plot
Visual Analysis Benefits: - Pattern Recognition: Identify systematic missing data patterns - Type Validation: Verify variable types match expectations - Missing Clustering: Detect non-random missing data mechanisms - Quality Overview: Comprehensive visual summary of data structure
Example 3: Missing Data Pattern Analysis
Focused analysis of missing data patterns using vis_miss.
# Missing data pattern analysis
missing_analysis <- dataquality(
data = histopathology,
vars = c("Age", "Grade", "TStage", "MeasurementA", "MeasurementB"),
check_duplicates = FALSE, # Focus on missing data
check_missing = TRUE,
complete_cases_only = TRUE, # Analyze complete cases
visual_analysis = TRUE,
visdat_type = "vis_miss", # Missing pattern visualization
missing_threshold_visual = 10,
export_plots = FALSE
)
# Examine missing data insights
print("🔍 Missing Data Pattern Analysis:")
print("=================================")
missing_analysis$text
missing_analysis$plot
Missing Pattern Insights: - MCAR Detection: Identify missing completely at random patterns - MAR Assessment: Detect missing at random relationships - MNAR Warning: Flag potentially missing not at random issues - Pattern Clustering: Visualize missing data relationships
Example 4: Data Type Validation
Using vis_guess for automatic data type detection and validation.
# Data type validation analysis
type_validation <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "OverallTime", "Death"),
check_duplicates = FALSE,
check_missing = FALSE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "vis_guess", # Type detection visualization
missing_threshold_visual = 15,
export_plots = TRUE # Enable export for documentation
)
# Review type validation results
print("🔢 Data Type Validation Analysis:")
print("================================")
type_validation$text
type_validation$plot
Type Validation Applications: - Import Verification: Confirm data imported with correct types - Analysis Readiness: Ensure variables are analysis-ready - Error Detection: Identify unexpected type conversions - Documentation: Create type validation reports
Example 5: Comprehensive Quality Assessment
Complete data quality assessment using all analysis types.
# Comprehensive data quality assessment
comprehensive_quality <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "TStage", "LVI", "PNI", "OverallTime", "Death"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = TRUE,
visual_analysis = TRUE,
visdat_type = "all_visual", # All visual analyses
missing_threshold_visual = 5,
export_plots = TRUE
)
# Review comprehensive results
print("📊 Comprehensive Data Quality Assessment:")
print("=======================================")
print("Quality Summary:")
comprehensive_quality$text
print("\nVisual Analysis Suite:")
comprehensive_quality$plot
Comprehensive Benefits: - Complete Coverage: All quality dimensions assessed - Integrated Analysis: Statistical and visual assessment combined - Publication Ready: Professional output for research documentation - Regulatory Compliance: Meets clinical research standards
Advanced Configuration Options
Visual Analysis Types
Data Overview (vis_dat)
# Data overview configuration
overview_config <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "TStage"),
visual_analysis = TRUE,
visdat_type = "vis_dat",
missing_threshold_visual = 10
)
Use Cases: - Initial data exploration - Variable type verification - Overall data structure assessment - Missing data overview
Missing Patterns (vis_miss)
# Missing pattern configuration
missing_config <- dataquality(
data = histopathology,
vars = c("MeasurementA", "MeasurementB", "OverallTime"),
visual_analysis = TRUE,
visdat_type = "vis_miss",
missing_threshold_visual = 5 # Stricter threshold for missing data
)
Use Cases: - Missing data mechanism investigation - Pattern clustering analysis - Systematic missingness detection - Imputation planning
Type Detection (vis_guess)
# Type detection configuration
type_config <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "Death"),
visual_analysis = TRUE,
visdat_type = "vis_guess",
missing_threshold_visual = 15
)
Use Cases: - Data import validation - Type conversion verification - Analysis preparation - Error detection
Value Expectations (vis_expect)
# Value expectation configuration
expect_config <- dataquality(
data = histopathology,
vars = c("Age", "OverallTime"),
visual_analysis = TRUE,
visdat_type = "vis_expect",
missing_threshold_visual = 10
)
Use Cases: - Outlier pattern detection - Expected value validation - Data anomaly identification - Quality threshold assessment
Missing Data Threshold Configuration
Threshold Selection Guidelines
# Conservative threshold (highlight variables with >2% missing)
conservative_analysis <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
visual_analysis = TRUE,
visdat_type = "vis_miss",
missing_threshold_visual = 2 # Very strict
)
# Standard threshold (highlight variables with >10% missing)
standard_analysis <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
visual_analysis = TRUE,
visdat_type = "vis_miss",
missing_threshold_visual = 10 # Recommended
)
# Permissive threshold (highlight variables with >25% missing)
permissive_analysis <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
visual_analysis = TRUE,
visdat_type = "vis_miss",
missing_threshold_visual = 25 # Lenient
)
Threshold Recommendations: - Primary Endpoints: 2-5% (strict quality requirements) - Secondary Variables: 10-15% (standard clinical research) - Exploratory Analysis: 20-25% (hypothesis generation)
Export and Documentation Options
Publication-Ready Plots
# Export-ready configuration
export_analysis <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade", "TStage"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "all_visual",
missing_threshold_visual = 10,
export_plots = TRUE # Enable export features
)
# Export information will be included in results
export_analysis$plot
Export Features: - High-resolution plot generation ready for manuscripts - Code examples for reproducing analyses - Professional formatting for publications - Integration with visdat export capabilities
Clinical Research Applications
Biomarker Data Quality Assessment
Precision Medicine Requirements
# Create biomarker test data (simulated for demonstration)
biomarker_data <- data.frame(
patient_id = paste0("BM", sprintf("%03d", 1:100)),
her2_expression = c(
# HER2 negative patients
rlnorm(80, meanlog = 1.5, sdlog = 0.8),
# HER2 positive patients
rlnorm(20, meanlog = 3.2, sdlog = 0.6)
),
pdl1_percentage = pmax(0, pmin(100, rnorm(100, 25, 20))), # Bounded 0-100%
response_status = factor(sample(c("Complete Response", "Partial Response", "Stable Disease", "Progressive Disease"),
100, replace = TRUE))
)
# Assess biomarker quality
biomarker_quality <- dataquality(
data = biomarker_data,
vars = c("her2_expression", "pdl1_percentage", "response_status"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "vis_dat",
missing_threshold_visual = 5, # Strict for biomarkers
export_plots = TRUE
)
print("🧬 Biomarker Data Quality Assessment:")
biomarker_quality$text
biomarker_quality$plot
Biomarker Quality Standards: - Coefficient of Variation: <20% for most assays - Missing Data: <5% for regulatory submissions - Range Validation: Biologically meaningful bounds - Outlier Investigation: Essential for clinical decisions
Clinical Trial Data Monitoring
Real-Time Quality Monitoring
# Simulate ongoing clinical trial data
trial_data <- data.frame(
patient_id = paste0("CT", sprintf("%03d", 1:75)),
enrollment_date = seq(as.Date("2024-01-01"), as.Date("2024-03-15"), length.out = 75),
primary_endpoint = c(
rnorm(60, 2.5, 1.2), # Most patients
rep(NA, 10), # Recent enrollments, not yet assessed
c(8.5, -2.1, 15.3, 0.1, 9.8) # Some outliers
),
safety_score = sample(0:4, 75, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.08, 0.02)),
treatment_group = factor(sample(c("Treatment", "Control"), 75, replace = TRUE))
)
# Monitor trial data quality
trial_quality <- dataquality(
data = trial_data,
vars = c("primary_endpoint", "safety_score", "treatment_group"),
check_duplicates = FALSE, # Not applicable for trial data
check_missing = TRUE,
complete_cases_only = TRUE,
visual_analysis = TRUE,
visdat_type = "vis_miss", # Focus on missing patterns
missing_threshold_visual = 10,
export_plots = FALSE
)
print("🏥 Clinical Trial Data Quality Monitoring:")
trial_quality$text
trial_quality$plot
# Check if quality meets regulatory standards
missing_pct <- 100 * sum(is.na(trial_data$primary_endpoint)) / nrow(trial_data)
if (missing_pct > 10) {
cat("⚠️ WARNING: Missing data exceeds 10% - investigate data collection procedures\n")
} else {
cat("✅ PASS: Missing data within acceptable limits\n")
}
Pharmacovigilance Data Quality
Adverse Event Reporting Quality
# Simulate adverse event data
ae_data <- data.frame(
patient_id = paste0("AE", sprintf("%03d", 1:150)),
severity_score = sample(1:5, 150, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.08, 0.02)),
onset_days = c(
sample(1:30, 120, replace = TRUE), # Normal onset times
rep(NA, 20), # Missing onset times
c(365, 400, -5, 0, 2000, 1500, 750, 800, 1200, 950) # Problematic values
),
recovery_status = factor(
sample(c("Recovered", "Recovering", "Not Recovered", "Unknown"),
150, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.1))
)
)
# Assess adverse event data quality
ae_quality <- dataquality(
data = ae_data,
vars = c("severity_score", "onset_days", "recovery_status"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "all_visual", # Comprehensive analysis for safety
missing_threshold_visual = 15,
export_plots = TRUE
)
print("⚠️ Pharmacovigilance Data Quality Assessment:")
ae_quality$text
ae_quality$plot
Quality Assurance Best Practices
Data Quality Validation Workflow
Pre-Analysis Quality Check
# Function for standardized pre-analysis quality assessment
perform_preanalysis_quality_check <- function(data, primary_vars, secondary_vars = NULL) {
cat("📋 PRE-ANALYSIS QUALITY ASSESSMENT\n")
cat("=================================\n")
# Primary variables quality assessment
primary_quality <- dataquality(
data = data,
vars = primary_vars,
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = TRUE,
visual_analysis = TRUE,
visdat_type = "all_visual",
missing_threshold_visual = 5, # Strict for primary variables
export_plots = TRUE
)
cat("PRIMARY VARIABLES ASSESSMENT:\n")
cat("----------------------------\n")
primary_quality$text
# Secondary variables if provided
if (!is.null(secondary_vars)) {
secondary_quality <- dataquality(
data = data,
vars = secondary_vars,
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "vis_miss",
missing_threshold_visual = 15, # More lenient for secondary
export_plots = FALSE
)
cat("\nSECONDARY VARIABLES ASSESSMENT:\n")
cat("------------------------------\n")
secondary_quality$text
}
return(list(primary = primary_quality, secondary = secondary_quality))
}
# Example usage
primary_vars <- c("Age", "Sex", "Grade", "TStage")
secondary_vars <- c("LVI", "PNI", "MeasurementA", "MeasurementB")
quality_results <- perform_preanalysis_quality_check(
data = histopathology,
primary_vars = primary_vars,
secondary_vars = secondary_vars
)
Quality Grading System
The dataquality function provides implicit quality grading based on analysis results:
Grade A: Excellent Quality - Missing data: 0-5% - Minimal duplicates - Appropriate data types - No systematic patterns
Grade B: Good Quality - Missing data: 5-15% - Some duplicate patterns - Minor type issues - Limited systematic concerns
Grade C: Concerning Quality - Missing data: 15-30% - Notable duplicate issues - Type validation problems - Systematic patterns detected
Grade D: Poor Quality - Missing data: >30% - Significant quality issues - Multiple systematic problems - Requires intervention
Integration with jamovi Workflow
Parameter Configuration in jamovi
- Variables: Select variables for quality assessment
- Duplicates: Enable duplicate detection analysis
- Missing Values: Enable missing data analysis
- Complete Cases: Enable complete case analysis
- Visual Analysis: Enable visdat integration
- Analysis Type: Choose vis_dat, vis_miss, vis_guess, vis_expect, or all_visual
- Missing Threshold: Set percentage threshold for missing data highlighting
- Export Plots: Enable plot export functionality
Troubleshooting and Common Issues
visdat Package Dependencies
Installation Requirements
# Check visdat availability
if (!requireNamespace("visdat", quietly = TRUE)) {
cat("📦 Installing visdat package for enhanced visual analysis...\n")
install.packages("visdat")
}
# Verify installation
if (requireNamespace("visdat", quietly = TRUE)) {
cat("✅ visdat package available - full visual analysis enabled\n")
} else {
cat("⚠️ visdat package not available - visual analysis will show informative messages\n")
cat(" Install with: install.packages('visdat')\n")
}
Alternative Analysis Without visdat
# Data quality assessment without visual analysis
no_visual_quality <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = FALSE, # Disable visual analysis
visdat_type = "vis_dat",
missing_threshold_visual = 10,
export_plots = FALSE
)
print("📊 Data Quality Assessment (No Visual Analysis):")
no_visual_quality$text
Performance Optimization
Large Dataset Handling
# Tips for large datasets (>1000 rows, >50 variables)
# 1. Subset variables for initial assessment
key_vars <- c("Age", "Sex", "Grade", "TStage", "Death")
subset_quality <- dataquality(
data = histopathology,
vars = key_vars, # Focus on key variables first
check_duplicates = FALSE, # Skip duplicates for speed
check_missing = TRUE,
complete_cases_only = FALSE,
visual_analysis = TRUE,
visdat_type = "vis_dat", # Use faster vis_dat over all_visual
missing_threshold_visual = 10,
export_plots = FALSE
)
# 2. Use sampling for very large datasets
if (nrow(histopathology) > 1000) {
sample_data <- histopathology[sample(nrow(histopathology), 500), ]
sample_quality <- dataquality(
data = sample_data,
vars = c("Age", "Sex", "Grade"),
visual_analysis = TRUE,
visdat_type = "all_visual",
missing_threshold_visual = 10
)
}
Memory Management
# Monitor memory usage during quality assessment
cat("Memory usage before analysis:", format(object.size(histopathology), units = "MB"), "\n")
# Perform analysis
quality_result <- dataquality(
data = histopathology,
vars = c("Age", "Sex", "Grade"),
visual_analysis = TRUE,
visdat_type = "vis_dat"
)
# Clean up if needed
# rm(quality_result)
# gc() # Garbage collection
Advanced Integration Examples
Integration with Other ClinicoPath Functions
Quality Assessment Before crosstable Analysis
# Step 1: Assess data quality
quality_check <- dataquality(
data = histopathology,
vars = c("Sex", "Grade", "Group"),
check_duplicates = TRUE,
check_missing = TRUE,
visual_analysis = TRUE,
visdat_type = "vis_miss"
)
print("Quality assessment before crosstable analysis:")
quality_check$text
# Step 2: Proceed with crosstable if quality is acceptable
if (# quality meets standards - implement quality check logic here
TRUE) { # Placeholder condition
crosstable_result <- crosstable(
data = histopathology,
vars = c("Sex", "Grade"),
group = "Group"
)
cat("✅ Quality check passed - proceeding with crosstable analysis\n")
}
Quality Assessment Before Survival Analysis
# Assess survival analysis variables
survival_quality <- dataquality(
data = histopathology,
vars = c("OverallTime", "Death", "Grade", "TStage"),
check_duplicates = FALSE, # Not applicable for survival data
check_missing = TRUE,
complete_cases_only = TRUE, # Critical for survival analysis
visual_analysis = TRUE,
visdat_type = "vis_dat",
missing_threshold_visual = 5 # Strict for survival endpoints
)
print("Survival analysis data quality:")
survival_quality$text
Automated Quality Reporting
Generate Quality Report
# Function to generate automated quality report
generate_quality_report <- function(data, variables, report_title = "Data Quality Report") {
cat("📊", report_title, "\n")
cat(rep("=", nchar(report_title) + 4), "\n")
cat("Generated on:", Sys.Date(), "\n")
cat("Dataset:", deparse(substitute(data)), "\n")
cat("Variables analyzed:", length(variables), "\n")
cat("Observations:", nrow(data), "\n\n")
# Comprehensive quality assessment
quality_result <- dataquality(
data = data,
vars = variables,
check_duplicates = TRUE,
check_missing = TRUE,
complete_cases_only = TRUE,
visual_analysis = TRUE,
visdat_type = "all_visual",
missing_threshold_visual = 10,
export_plots = TRUE
)
cat("QUALITY ASSESSMENT RESULTS:\n")
cat("-------------------------\n")
quality_result$text
cat("\nVISUAL ANALYSIS:\n")
cat("---------------\n")
quality_result$plot
return(quality_result)
}
# Generate report for histopathology data
quality_report <- generate_quality_report(
data = histopathology,
variables = c("Age", "Sex", "Grade", "TStage", "LVI", "PNI"),
report_title = "Histopathology Data Quality Report"
)
Conclusion
The ClinicoPath dataquality
function represents a
significant advancement in clinical research data validation tools. By
integrating sophisticated visual analysis capabilities with traditional
statistical quality assessment, it provides researchers with
unprecedented insight into data quality issues.
Key Benefits
- Comprehensive Analysis: Statistical and visual assessment in one tool
- Clinical Focus: Designed specifically for medical research workflows
- Advanced Visualization: First jamovi integration with visdat package
- Flexible Configuration: Customizable analysis options and thresholds
- Export Capabilities: Publication-ready outputs for research documentation
- Regulatory Compliance: Supports ICH E6 and FDA data quality standards
Success Factors
- Regular Assessment: Integrate quality checks throughout data lifecycle
- Visual Interpretation: Leverage visual patterns for deeper insights
- Threshold Customization: Adjust quality standards for research context
- Documentation Standards: Maintain comprehensive quality audit trails
- Continuous Monitoring: Implement ongoing quality surveillance
Future Enhancements
- Machine Learning Integration: Automated pattern recognition and anomaly detection
- Real-Time Monitoring: Continuous quality assessment during data collection
- Predictive Quality Models: Early warning systems for quality deterioration
- Advanced Export Options: Enhanced publication-ready output formats
- Integration Expansion: Deeper connections with other ClinicoPath modules
The dataquality
function establishes a new standard for
clinical research data validation, combining the power of statistical
analysis with the insight of visual exploration to ensure the highest
quality foundation for medical research and evidence-based healthcare
decisions.
This comprehensive guide demonstrates the full capabilities of enhanced data quality assessment in the ClinicoPath dataquality function, providing users with the theoretical foundation, practical skills, and professional standards needed for effective clinical research data management and quality assurance.