Data Preparation for Survival Analysis
Source:vignettes/general-11-data-preparation.Rmd
general-11-data-preparation.Rmd
Overview
Proper data preparation is crucial for reliable survival analysis. This vignette provides comprehensive guidance on preparing data for use with jsurvival, covering common data issues and their solutions.
Essential Data Components
Time Variable Preparation
Calculating Time from Dates
Using Built-in Time Interval Calculator
The jsurvival module includes a time interval calculator:
# In jamovi: Survival → Data Preparation → Time Interval Calculator
# Assign:
# - Start Date: Study entry date (e.g., diagnosis_date)
# - End Date: Event/censoring date (e.g., death_date or last_followup_date)
# - Output Unit: months or years
# This creates a new variable with calculated time intervals
Manual Calculation in R
library(lubridate)
# Example data with dates
clinical_data <- data.frame(
patient_id = 1:100,
diagnosis_date = as.Date("2020-01-01") + sample(0:365, 100, replace = TRUE),
death_date = as.Date("2020-01-01") + sample(200:1000, 100, replace = TRUE),
last_followup_date = as.Date("2023-12-31")
)
# Calculate time to death or censoring
clinical_data$death_occurred <- !is.na(clinical_data$death_date)
clinical_data$event_date <- ifelse(clinical_data$death_occurred,
clinical_data$death_date,
clinical_data$last_followup_date)
clinical_data$event_date <- as.Date(clinical_data$event_date, origin = "1970-01-01")
# Time in months
clinical_data$time_months <- as.numeric(
difftime(clinical_data$event_date, clinical_data$diagnosis_date, units = "days")
) / 30.44 # Average days per month
# Time in years
clinical_data$time_years <- as.numeric(
difftime(clinical_data$event_date, clinical_data$diagnosis_date, units = "days")
) / 365.25 # Accounting for leap years
Handling Time Zero Issues
Left Truncation
# Patients entering study at different times
# Example: analyzing from landmark time (e.g., 3 months post-diagnosis)
# Include only patients who survived to landmark
landmark_time <- 3 # months
eligible_data <- subset(clinical_data, time_months >= landmark_time)
# Adjust time origin to landmark
eligible_data$time_from_landmark <- eligible_data$time_months - landmark_time
Negative Time Values
# Check for and handle negative times
negative_times <- clinical_data$time_months < 0
if(any(negative_times, na.rm = TRUE)) {
print(paste("Found", sum(negative_times, na.rm = TRUE), "negative time values"))
# Options:
# 1. Exclude these patients
clinical_data <- clinical_data[!negative_times | is.na(negative_times), ]
# 2. Set to small positive value
# clinical_data$time_months[negative_times] <- 0.001
# 3. Investigate data entry errors
# View(clinical_data[negative_times, ])
}
Event Indicator Preparation
Creating Binary Event Variables
Other Events
# Disease progression
clinical_data$progression <- ifelse(clinical_data$progression_status == "Progressive Disease", 1, 0)
# Recurrence
clinical_data$recurrence <- ifelse(!is.na(clinical_data$recurrence_date), 1, 0)
# Composite endpoint (death or progression, whichever comes first)
clinical_data$death_or_progression <- pmax(clinical_data$death_binary,
clinical_data$progression,
na.rm = TRUE)
Multiple Event Types
Competing Risks Setup
# Create separate indicators for different event types
clinical_data$death_cancer <- ifelse(clinical_data$death_cause == "Cancer", 1, 0)
clinical_data$death_other <- ifelse(clinical_data$death_cause == "Other" |
clinical_data$death_cause == "Cardiovascular", 1, 0)
# For patients still alive, both should be 0
alive_patients <- clinical_data$vital_status == "Alive"
clinical_data$death_cancer[alive_patients] <- 0
clinical_data$death_other[alive_patients] <- 0
# Verify: sum should equal total deaths
table(clinical_data$death_cancer + clinical_data$death_other,
clinical_data$vital_status)
Data Quality Checks
Consistency Checks
# Function to perform comprehensive data quality checks
check_survival_data <- function(data, time_var, event_var) {
checks <- list()
# 1. Missing values
checks$missing_time <- sum(is.na(data[[time_var]]))
checks$missing_event <- sum(is.na(data[[event_var]]))
# 2. Negative or zero times
checks$negative_times <- sum(data[[time_var]] <= 0, na.rm = TRUE)
# 3. Event indicator values
event_values <- unique(data[[event_var]])
checks$valid_events <- all(event_values %in% c(0, 1, NA))
checks$event_table <- table(data[[event_var]], useNA = "always")
# 4. Extreme values
checks$extreme_times <- list(
max_time = max(data[[time_var]], na.rm = TRUE),
min_time = min(data[[time_var]], na.rm = TRUE),
outliers = sum(data[[time_var]] > quantile(data[[time_var]], 0.99, na.rm = TRUE), na.rm = TRUE)
)
# 5. Event rate
checks$event_rate <- mean(data[[event_var]], na.rm = TRUE)
return(checks)
}
# Run checks
data_quality <- check_survival_data(clinical_data, "time_months", "death_binary")
print(data_quality)
Outlier Detection
# Identify potential outliers in survival times
Q1 <- quantile(clinical_data$time_months, 0.25, na.rm = TRUE)
Q3 <- quantile(clinical_data$time_months, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Define outliers using 1.5 * IQR rule
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- clinical_data$time_months < lower_bound | clinical_data$time_months > upper_bound
print(paste("Found", sum(outliers, na.rm = TRUE), "potential outliers"))
# Review outliers
if(sum(outliers, na.rm = TRUE) > 0) {
outlier_data <- clinical_data[outliers & !is.na(outliers),
c("patient_id", "time_months", "death_binary")]
print(outlier_data)
}
Handling Missing Data
Missing Data Patterns
# Assess missing data patterns
library(VIM) # Visualization and Imputation of Missing values
# Create missingness indicator
clinical_data$time_missing <- is.na(clinical_data$time_months)
clinical_data$event_missing <- is.na(clinical_data$death_binary)
# Missing data summary
missing_summary <- clinical_data %>%
summarise(
n_total = n(),
time_missing = sum(time_missing),
event_missing = sum(event_missing),
both_missing = sum(time_missing & event_missing),
complete_cases = sum(!time_missing & !event_missing)
)
print(missing_summary)
Strategies for Missing Data
Administrative Censoring
# For missing death dates, censor at last known follow-up
clinical_data$time_months_imputed <- clinical_data$time_months
clinical_data$death_binary_imputed <- clinical_data$death_binary
# If death status missing but last follow-up available, assume alive
missing_death <- is.na(clinical_data$death_binary) & !is.na(clinical_data$last_followup_date)
if(any(missing_death)) {
clinical_data$death_binary_imputed[missing_death] <- 0
clinical_data$time_months_imputed[missing_death] <-
as.numeric(difftime(clinical_data$last_followup_date[missing_death],
clinical_data$diagnosis_date[missing_death],
units = "days")) / 30.44
}
Explanatory Variable Preparation
Categorical Variables
Factor Ordering
# Ensure proper ordering for categorical variables
clinical_data$tumor_grade <- factor(clinical_data$tumor_grade,
levels = c("Grade 1", "Grade 2", "Grade 3", "Grade 4"),
ordered = TRUE)
clinical_data$stage <- factor(clinical_data$stage,
levels = c("Stage I", "Stage II", "Stage III", "Stage IV"),
ordered = TRUE)
# For unordered factors, set reference level
clinical_data$treatment <- factor(clinical_data$treatment)
clinical_data$treatment <- relevel(clinical_data$treatment, ref = "Standard")
Handling Rare Categories
# Combine rare categories to ensure adequate sample sizes
table(clinical_data$histology)
# Combine categories with <10 patients
rare_histologies <- names(table(clinical_data$histology))[table(clinical_data$histology) < 10]
clinical_data$histology_grouped <- as.character(clinical_data$histology)
clinical_data$histology_grouped[clinical_data$histology_grouped %in% rare_histologies] <- "Other"
clinical_data$histology_grouped <- factor(clinical_data$histology_grouped)
table(clinical_data$histology_grouped)
Continuous Variables
Age Preparation
# Check age distribution
summary(clinical_data$age_at_diagnosis)
# Create age groups if needed
clinical_data$age_group <- cut(clinical_data$age_at_diagnosis,
breaks = c(0, 50, 65, 80, Inf),
labels = c("≤50", "51-65", "66-80", ">80"),
include.lowest = TRUE)
# Age as decades
clinical_data$age_decades <- floor(clinical_data$age_at_diagnosis / 10) * 10
Biomarker Preparation
# Log transformation for skewed biomarkers
clinical_data$log_biomarker <- log10(clinical_data$biomarker_value + 1)
# Standardization (z-score)
clinical_data$biomarker_zscore <- scale(clinical_data$biomarker_value)[,1]
# Handle extreme values
biomarker_99th <- quantile(clinical_data$biomarker_value, 0.99, na.rm = TRUE)
clinical_data$biomarker_winsorized <- pmin(clinical_data$biomarker_value, biomarker_99th)
Data Validation
Cross-Validation with Source Data
# Compare calculated times with expected ranges
validation_checks <- function(data) {
issues <- list()
# Check if all events occurred before end of study
study_end <- as.Date("2023-12-31")
late_events <- data$event_date > study_end & data$death_binary == 1
issues$late_events <- sum(late_events, na.rm = TRUE)
# Check if follow-up times are reasonable
unreasonable_followup <- data$time_months > 240 # > 20 years
issues$long_followup <- sum(unreasonable_followup, na.rm = TRUE)
# Check for impossible age at events
impossible_age <- data$age_at_diagnosis < 0 | data$age_at_diagnosis > 120
issues$impossible_age <- sum(impossible_age, na.rm = TRUE)
return(issues)
}
validation_results <- validation_checks(clinical_data)
print(validation_results)
Final Dataset Preparation
# Create final analysis dataset
analysis_data <- clinical_data %>%
# Include only complete cases for primary analysis
filter(!is.na(time_months), !is.na(death_binary)) %>%
# Remove patients with negative or zero follow-up time
filter(time_months > 0) %>%
# Remove patients with missing key covariates
filter(!is.na(age_at_diagnosis), !is.na(stage)) %>%
# Select relevant variables
select(
patient_id,
time_months,
death_binary,
age_at_diagnosis,
age_group,
sex,
stage,
tumor_grade,
treatment,
biomarker_value,
histology_grouped
)
# Final data summary
cat("Final analysis dataset:\n")
cat("N =", nrow(analysis_data), "\n")
cat("Events =", sum(analysis_data$death_binary), "\n")
cat("Event rate =", round(mean(analysis_data$death_binary) * 100, 1), "%\n")
cat("Median follow-up =", round(median(analysis_data$time_months), 1), "months\n")
Export for jamovi
CSV Export
# Export cleaned data for use in jamovi
write.csv(analysis_data, "survival_analysis_data.csv", row.names = FALSE)
# Include data dictionary
data_dictionary <- data.frame(
variable = names(analysis_data),
description = c(
"Patient identifier",
"Time from diagnosis to death or censoring (months)",
"Death indicator (1=death, 0=alive/censored)",
"Age at diagnosis (years)",
"Age group categories",
"Patient sex",
"Cancer stage",
"Tumor grade",
"Treatment received",
"Biomarker value",
"Histology type (grouped)"
),
type = c(
"Identifier",
"Continuous",
"Binary",
"Continuous",
"Categorical",
"Categorical",
"Ordinal",
"Ordinal",
"Categorical",
"Continuous",
"Categorical"
)
)
write.csv(data_dictionary, "data_dictionary.csv", row.names = FALSE)
Quality Assurance Checklist
Common Data Preparation Mistakes
- Incorrect time calculation: Using calendar time instead of time from study entry
- Wrong event coding: Mixing up 0 and 1 for events and censoring
- Including impossible values: Negative times, events after last follow-up
- Ignoring competing risks: Treating all deaths as same event type
- Improper handling of missing data: Excluding too many patients without justification
- Not checking assumptions: Ignoring data distribution and outliers
Conclusion
Proper data preparation is essential for reliable survival analysis results. Key principles include:
- Careful time calculation with appropriate handling of dates and time zero
- Consistent event coding with validation against source data
- Comprehensive quality checks to identify and resolve data issues
- Appropriate handling of missing data based on missingness patterns
- Proper preparation of explanatory variables with attention to clinical meaning
Taking time for thorough data preparation will ensure more reliable and interpretable survival analysis results in jsurvival.