Skip to contents

Comprehensive collection of test datasets for the timeinterval function, covering various date formats, clinical scenarios, and edge cases for robust testing of time interval calculations.

Clinical trial data with YYYY-MM-DD date format, treatment groups, and realistic enrollment/follow-up patterns for testing basic time interval calculations.

European clinical data with DD/MM/YYYY date format for testing international date format compatibility and disease progression scenarios.

Hospital admission/discharge data with MM/DD/YYYY HH:MM:SS datetime format for testing high-precision time calculations and data quality issues.

Population-based cohort study data with various exit reasons and extreme values for testing comprehensive data quality assessment features.

Cancer study dataset specifically designed for testing landmark analysis functionality with 6-month landmark time point.

Deliberately challenging dataset with mixed date formats in the same columns for testing automatic date format detection capabilities.

Summary information for all timeinterval test datasets including observation counts, descriptions, and key features.

Documentation of specific test scenarios, recommended datasets, and expected results for comprehensive timeinterval testing.

Usage

timeinterval_clinical_trial

timeinterval_european_dates

timeinterval_us_datetime

timeinterval_epidemiological

timeinterval_landmark

timeinterval_mixed_formats

timeinterval_datasets_summary

timeinterval_test_scenarios

Format

Various data frames with different structures optimized for specific testing scenarios

A data frame with 200 observations and 8 variables:

patient_id

Character. Unique patient identifier (CT_001 to CT_200)

treatment_group

Character. Treatment assignment: "Treatment A", "Treatment B", "Control"

age

Numeric. Patient age at enrollment (mean=65, sd=12)

sex

Character. Patient sex: "Male", "Female"

enrollment_date_ymd

Character. Study enrollment date in YYYY-MM-DD format

followup_date_ymd

Character. Last follow-up date in YYYY-MM-DD format (some missing)

event_occurred

Numeric. Binary indicator of primary event (0/1)

site_location

Character. Study site: "Site A", "Site B", "Site C"

A data frame with 150 observations and 7 variables:

study_id

Character. Unique study identifier (EU_001 to EU_150)

country

Character. European country: "Germany", "France", "Italy", "Spain", "UK"

diagnosis_date_dmy

Character. Diagnosis date in DD/MM/YYYY format

last_visit_dmy

Character. Last clinical visit in DD/MM/YYYY format

disease_stage

Character. Disease stage: "I", "II", "III", "IV"

outcome_status

Character. Patient status: "Alive", "Deceased", "Lost to Follow-up"

comorbidity_score

Numeric. Comorbidity burden score (0-10)

A data frame with 180 observations and 8 variables:

record_id

Character. Unique record identifier (US_0001 to US_0180)

hospital_unit

Character. Hospital unit: "ICU", "Emergency", "Surgery", "Medical", "Pediatric"

admission_datetime

Character. Admission date/time in MM/DD/YYYY HH:MM:SS format

discharge_datetime

Character. Discharge date/time in MM/DD/YYYY HH:MM:SS format

primary_diagnosis

Character. Primary diagnosis category

severity_score

Numeric. Illness severity score

readmission_30d

Numeric. 30-day readmission indicator (0/1)

insurance_type

Character. Insurance type: "Private", "Medicare", "Medicaid", "Uninsured"

A data frame with 250 observations and 9 variables:

participant_id

Character. Unique participant identifier (EPI_0001 to EPI_0250)

geographic_region

Character. Region: "Urban", "Suburban", "Rural"

cohort_entry_date

Character. Cohort entry date in YYYY-MM-DD format

exit_date

Character. Study exit date (some missing for testing)

exit_reason

Character. Exit reason: "event", "censored", "death", "emigration"

age_at_entry

Numeric. Age at cohort entry (mean=45, sd=15)

exposure_status

Character. Exposure level: "High", "Medium", "Low", "None"

socioeconomic_status

Character. SES: "High", "Medium", "Low"

baseline_health_score

Numeric. Health score at baseline (mean=75, sd=12)

A data frame with 120 observations and 8 variables:

patient_id

Character. Unique patient identifier (LM_001 to LM_120)

cancer_type

Character. Cancer type: "Breast", "Lung", "Colorectal", "Prostate", "Lymphoma"

diagnosis_date

Character. Diagnosis date (all same date for landmark testing)

last_contact_date

Character. Last contact date

vital_status

Character. Status: "Deceased", "Alive"

treatment_received

Character. Treatment type: "Surgery Only", "Surgery + Chemo", etc.

response_6m

Character. 6-month response: "Complete", "Partial", "Stable", "Progressive"

landmark_eligible

Logical. TRUE if survived past 6-month landmark

A data frame with 100 observations and 7 variables:

sample_id

Character. Unique sample identifier (MX_001 to MX_100)

data_source

Character. Data source: "Manual Entry", "Electronic Import", etc.

start_date_mixed

Character. Start dates in MIXED formats (YMD, DMY, MDY, YDM)

end_date_mixed

Character. End dates in MIXED formats (matching start format)

data_quality_flag

Character. Quality flag: "High", "Medium", "Low"

operator_id

Character. Data entry operator identifier

verification_status

Character. Status: "Verified", "Pending", "Flagged"

A data frame with 6 observations and 4 variables:

Dataset

Character. Dataset name

Observations

Numeric. Number of observations in dataset

Description

Character. Brief dataset description

Key_Features

Character. Key testing features

A data frame with 10 observations and 3 variables:

Scenario

Character. Test scenario name

Dataset

Character. Recommended dataset for testing

Expected_Result

Character. Expected outcome description

Source

Generated by ClinicoPath development team for comprehensive function testing

Details

This collection includes six specialized datasets designed to test different aspects of the timeinterval function:

  • Multiple date format parsing (YMD, DMY, MDY, datetime)

  • Clinical trial scenarios with treatment groups

  • International date format compatibility

  • High-precision datetime calculations

  • Epidemiological cohort studies

  • Landmark analysis test cases

  • Data quality assessment scenarios

  • Missing value and edge case handling

This dataset simulates a multicenter clinical trial with:

  • Staggered enrollment over 1 year (2020-2021)

  • Realistic follow-up periods (30-500 days, Poisson distributed)

  • ~5% missing follow-up dates for robustness testing

  • Event rate of ~25% across all groups

  • Three treatment arms with different allocation probabilities

Ideal for testing:

  • Basic YMD date parsing

  • Missing value handling

  • Treatment group stratification

  • Clinical trial time-to-event analysis

This dataset represents a European multicenter study with:

  • DD/MM/YYYY date format (European standard)

  • Disease staging data for oncology analysis

  • Follow-up periods ranging from 6 months to 3 years

  • Realistic outcome distributions

  • Country-specific enrollment patterns

Ideal for testing:

  • DMY date format parsing

  • International date standards

  • Disease progression analysis

  • Multi-country study coordination

This dataset simulates hospital electronic health record data with:

  • High-precision datetime stamps (hours/minutes/seconds)

  • Length of stay ranging from 2 hours to several days

  • Realistic hospital unit distributions

  • ~5 intentional negative intervals (discharge before admission) for quality testing

  • Various diagnosis categories and severity levels

Ideal for testing:

  • MDY datetime format parsing

  • High-precision time calculations

  • Negative interval detection

  • Hospital length of stay analysis

  • Data quality assessment features

This dataset represents a population-based epidemiological study with:

  • Follow-up periods from 2018-2023 (up to 6 years)

  • Multiple exit scenarios (15% events, 65% censored, 15% death, 5% emigration)

  • ~8 extreme follow-up values (10-20 years) for outlier testing

  • ~4 missing exit dates for missing data testing

  • Realistic demographic and exposure distributions

Ideal for testing:

  • Long-term follow-up calculations

  • Extreme value detection

  • Missing data handling

  • Population health analysis

  • Comprehensive data quality assessment

This dataset is specifically designed for landmark analysis testing with:

  • Common diagnosis date for all patients (2020-03-01)

  • ~30% of patients with events before 6-month landmark (to be excluded)

  • ~70% surviving past landmark with varying additional follow-up

  • 5 patients with exactly 6-month follow-up for boundary testing

  • Realistic cancer treatment and response patterns

Ideal for testing:

  • Landmark analysis at 6 months

  • Conditional survival calculations

  • Patient exclusion criteria

  • Cancer survival analysis

  • Treatment response correlation

This dataset intentionally mixes date formats within the same column:

  • 25% YYYY-MM-DD format

  • 25% DD/MM/YYYY format

  • 25% MM/DD/YYYY format

  • 25% YYYY/DD/MM format

  • First 3 entries contain obviously invalid dates for error testing

  • Simulates real-world data integration challenges

Ideal for testing:

  • Automatic date format detection

  • Mixed format handling

  • Error detection and reporting

  • Data quality assessment

  • Robust parsing algorithms

Examples

if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_clinical_trial)

# Basic time interval calculation
timeinterval(
  data = timeinterval_clinical_trial,
  dx_date = "enrollment_date_ymd",
  fu_date = "followup_date_ymd",
  time_format = "ymd",
  output_unit = "months"
)

# Treatment group analysis
library(dplyr)
timeinterval_clinical_trial %>%
  group_by(treatment_group) %>%
  summarise(
    n = n(),
    events = sum(event_occurred, na.rm = TRUE)
  )
} # }
if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_european_dates)

# European date format analysis
timeinterval(
  data = timeinterval_european_dates,
  dx_date = "diagnosis_date_dmy",
  fu_date = "last_visit_dmy", 
  time_format = "dmy",
  output_unit = "months"
)

# Disease stage analysis
library(dplyr)
timeinterval_european_dates %>%
  group_by(disease_stage) %>%
  summarise(
    n = n(),
    deceased = sum(outcome_status == "Deceased", na.rm = TRUE)
  )
} # }
if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_us_datetime)

# High-precision datetime analysis
timeinterval(
  data = timeinterval_us_datetime,
  dx_date = "admission_datetime",
  fu_date = "discharge_datetime",
  time_format = "mdy",  # Will auto-detect datetime
  output_unit = "days",
  include_quality_metrics = TRUE
)

# Hospital unit analysis
library(dplyr)
timeinterval_us_datetime %>%
  group_by(hospital_unit) %>%
  summarise(
    n = n(),
    readmissions = sum(readmission_30d, na.rm = TRUE)
  )
} # }
if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_epidemiological)

# Epidemiological analysis with quality assessment
timeinterval(
  data = timeinterval_epidemiological,
  dx_date = "cohort_entry_date",
  fu_date = "exit_date",
  time_format = "ymd",
  output_unit = "years",
  include_quality_metrics = TRUE,
  remove_extreme = TRUE
)

# Exposure analysis
library(dplyr)
timeinterval_epidemiological %>%
  group_by(exposure_status, exit_reason) %>%
  summarise(n = n(), .groups = "drop")
} # }
if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_landmark)

# Landmark analysis at 6 months
timeinterval(
  data = timeinterval_landmark,
  dx_date = "diagnosis_date",
  fu_date = "last_contact_date",
  time_format = "ymd",
  output_unit = "months",
  use_landmark = TRUE,
  landmark_time = 6
)

# Treatment response analysis
library(dplyr)
timeinterval_landmark %>%
  filter(landmark_eligible) %>%
  group_by(response_6m) %>%
  summarise(
    n = n(),
    alive = sum(vital_status == "Alive", na.rm = TRUE)
  )
} # }
if (FALSE) { # \dontrun{
# Load the dataset
data(timeinterval_mixed_formats)

# Test automatic format detection
timeinterval(
  data = timeinterval_mixed_formats,
  dx_date = "start_date_mixed",
  fu_date = "end_date_mixed", 
  time_format = "auto",  # Test auto-detection
  output_unit = "days",
  include_quality_metrics = TRUE
)

# Data quality analysis
library(dplyr)
timeinterval_mixed_formats %>%
  group_by(data_quality_flag) %>%
  summarise(
    n = n(),
    verified = sum(verification_status == "Verified", na.rm = TRUE)
  )
} # }