Comprehensive Descriptive Statistics with jsummarytools
Professional Data Summaries using summarytools Integration
ClinicoPath
last-modified
Source:vignettes/jjstatsplot-33-jsummarytools-comprehensive.Rmd
jjstatsplot-33-jsummarytools-comprehensive.Rmd
Introduction
The jsummarytools
function provides comprehensive
descriptive statistics using the powerful summarytools
package. This function is designed specifically for clinical
researchers, epidemiologists, and data analysts who need
professional-quality descriptive statistics with publication-ready
output.
Key Features
- Four Analysis Types: Data frame summaries (dfSummary), frequency tables (freq), descriptive statistics (descr), and cross-tabulations (ctable)
- Professional Output: HTML-formatted tables and summaries ready for reports and publications
- Comprehensive Statistics: Complete descriptive statistics with missing data reporting
- Grouping Support: Stratified analyses by grouping variables
- Performance Optimized: Intelligent caching system for faster repeated analyses
- Flexible Customization: Extensive options for formatting and statistical content
Installation and Setup
# Load required libraries
library(ClinicoPath)
## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::count_automorphisms' by
## 'igraph::count_automorphisms' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::get_edge_ids' by
## 'igraph::get_edge_ids' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::union' by 'lubridate::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::%--%' by 'lubridate::%--%' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tnr' by 'mlr3measures::tnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::precision' by
## 'mlr3measures::precision' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tn' by 'mlr3measures::tn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fnr' by 'mlr3measures::fnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tp' by 'mlr3measures::tp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::npv' by 'mlr3measures::npv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::ppv' by 'mlr3measures::ppv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::auc' by 'mlr3measures::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tpr' by 'mlr3measures::tpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fn' by 'mlr3measures::fn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fp' by 'mlr3measures::fp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fpr' by 'mlr3measures::fpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::recall' by
## 'mlr3measures::recall' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::specificity' by
## 'mlr3measures::specificity' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::sensitivity' by
## 'mlr3measures::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::as_data_frame' by
## 'tibble::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::crossing' by 'tidyr::crossing' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::sensitivity' by
## 'caret::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::specificity' by
## 'caret::specificity' when loading 'ClinicoPath'
## Registered S3 methods overwritten by 'useful':
## method from
## autoplot.acf ggfortify
## fortify.acf ggfortify
## fortify.kmeans ggfortify
## fortify.ts ggfortify
## Warning: replacing previous import 'jmvcore::select' by 'dplyr::select' when
## loading 'ClinicoPath'
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
## Warning: replacing previous import 'DataExplorer::plot_histogram' by
## 'grafify::plot_histogram' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::auc' by 'pROC::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::roc' by 'pROC::roc' when loading
## 'ClinicoPath'
## Warning: replacing previous import 'tibble::view' by 'summarytools::view' when
## loading 'ClinicoPath'
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
# Set options for better output
options(digits = 3)
knitr::opts_chunk$set(
fig.width = 12,
fig.height = 8,
dpi = 300,
echo = TRUE,
eval = FALSE,
out.width = "100%"
)
# Check if summarytools is available
if (!requireNamespace("summarytools", quietly = TRUE)) {
message("Note: summarytools package not available. Install with: install.packages('summarytools')")
}
Data Preparation
Let’s create realistic datasets for different types of descriptive analyses:
Clinical Research Data
# Create comprehensive clinical dataset
set.seed(123)
clinical_data <- data.frame(
# Patient identifiers
patient_id = paste0("PT_", sprintf("%03d", 1:300)),
# Demographics
age = round(rnorm(300, 65, 12)),
gender = factor(sample(c("Male", "Female"), 300, replace = TRUE)),
race = factor(sample(c("White", "Black", "Hispanic", "Asian", "Other"),
300, replace = TRUE, prob = c(0.65, 0.15, 0.12, 0.06, 0.02))),
# Clinical measurements
height_cm = round(rnorm(300, 170, 12)),
weight_kg = round(rnorm(300, 75, 15)),
systolic_bp = round(rnorm(300, 140, 25)),
diastolic_bp = round(rnorm(300, 85, 15)),
heart_rate = round(rnorm(300, 75, 12)),
# Laboratory values
glucose_mg_dl = round(rnorm(300, 110, 30)),
cholesterol_mg_dl = round(rnorm(300, 200, 40)),
creatinine_mg_dl = round(rnorm(300, 1.0, 0.3), 2),
hemoglobin_g_dl = round(rnorm(300, 13.5, 2.0), 1),
# Medical history
diabetes = factor(sample(c("No", "Yes"), 300, replace = TRUE, prob = c(0.7, 0.3))),
hypertension = factor(sample(c("No", "Yes"), 300, replace = TRUE, prob = c(0.6, 0.4))),
smoking_status = factor(sample(c("Never", "Former", "Current"),
300, replace = TRUE, prob = c(0.5, 0.35, 0.15))),
# Treatment information
treatment_group = factor(sample(c("Control", "Treatment_A", "Treatment_B"),
300, replace = TRUE)),
study_site = factor(sample(c("Site_1", "Site_2", "Site_3"), 300, replace = TRUE)),
# Outcomes
treatment_response = factor(sample(c("No Response", "Partial", "Complete"),
300, replace = TRUE, prob = c(0.3, 0.5, 0.2))),
quality_of_life = round(rnorm(300, 70, 18)),
adverse_events = rpois(300, 1.2)
) %>%
mutate(
# Calculate BMI
bmi = round(weight_kg / (height_cm/100)^2, 1),
# Create realistic medical relationships
systolic_bp = pmax(90, systolic_bp + 0.3 * age +
ifelse(diabetes == "Yes", 15, 0) +
ifelse(smoking_status == "Current", 10, 0)),
glucose_mg_dl = pmax(70, glucose_mg_dl +
ifelse(diabetes == "Yes", 50, 0) +
0.1 * age),
cholesterol_mg_dl = pmax(120, cholesterol_mg_dl + 0.5 * age + 0.8 * bmi),
# Blood pressure category
bp_category = factor(case_when(
systolic_bp < 120 ~ "Normal",
systolic_bp < 130 ~ "Elevated",
systolic_bp < 140 ~ "Stage 1",
TRUE ~ "Stage 2"
), levels = c("Normal", "Elevated", "Stage 1", "Stage 2")),
# BMI category
bmi_category = factor(case_when(
bmi < 18.5 ~ "Underweight",
bmi < 25 ~ "Normal",
bmi < 30 ~ "Overweight",
TRUE ~ "Obese"
), levels = c("Underweight", "Normal", "Overweight", "Obese"))
)
# Display data structure
str(clinical_data)
Survey Research Data
# Create survey/questionnaire dataset
survey_data <- data.frame(
respondent_id = 1:500,
# Demographics
age_group = factor(sample(c("18-25", "26-35", "36-50", "51-65", "65+"),
500, replace = TRUE, prob = c(0.15, 0.25, 0.3, 0.2, 0.1))),
gender = factor(sample(c("Male", "Female", "Non-binary"),
500, replace = TRUE, prob = c(0.48, 0.5, 0.02))),
education = factor(sample(c("High School", "Some College", "Bachelor", "Graduate"),
500, replace = TRUE, prob = c(0.25, 0.25, 0.35, 0.15))),
income_bracket = factor(sample(c("Under $30k", "$30k-$60k", "$60k-$100k", "Over $100k"),
500, replace = TRUE, prob = c(0.25, 0.35, 0.25, 0.15))),
employment = factor(sample(c("Full-time", "Part-time", "Unemployed", "Student", "Retired"),
500, replace = TRUE, prob = c(0.55, 0.15, 0.1, 0.1, 0.1))),
# Likert scale responses (1-5: Strongly Disagree to Strongly Agree)
satisfaction_service = sample(1:5, 500, replace = TRUE, prob = c(0.05, 0.1, 0.2, 0.45, 0.2)),
satisfaction_price = sample(1:5, 500, replace = TRUE, prob = c(0.1, 0.15, 0.25, 0.35, 0.15)),
satisfaction_quality = sample(1:5, 500, replace = TRUE, prob = c(0.03, 0.07, 0.15, 0.5, 0.25)),
# 7-point agreement scales
recommend_likelihood = sample(1:7, 500, replace = TRUE,
prob = c(0.05, 0.05, 0.1, 0.15, 0.2, 0.25, 0.2)),
value_perception = sample(1:7, 500, replace = TRUE,
prob = c(0.08, 0.07, 0.12, 0.18, 0.25, 0.2, 0.1)),
# Frequency responses
usage_frequency = factor(sample(c("Never", "Rarely", "Sometimes", "Often", "Always"),
500, replace = TRUE, prob = c(0.05, 0.15, 0.3, 0.35, 0.15))),
# Binary choices
will_recommend = factor(sample(c("No", "Yes"), 500, replace = TRUE, prob = c(0.25, 0.75))),
return_customer = factor(sample(c("No", "Yes"), 500, replace = TRUE, prob = c(0.3, 0.7))),
# Continuous measures
years_customer = round(runif(500, 0, 10), 1),
monthly_spending = round(exp(rnorm(500, 4.5, 0.8)), 2),
overall_rating = round(rnorm(500, 4.2, 0.8), 1)
) %>%
mutate(
# Constrain ratings to valid ranges
overall_rating = pmax(1, pmin(5, overall_rating))
)
str(survey_data)
Basic Usage
Data Frame Summary (dfSummary)
The most comprehensive overview of your dataset:
# Basic data frame summary
summary_result <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary"
)
# The result contains comprehensive information about each variable
print(names(summary_result))
Customized dfSummary
# Customized data frame summary with specific options
custom_summary <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "gender", "bmi", "systolic_bp", "diabetes", "treatment_group"),
show_graphs = TRUE,
show_labels = TRUE,
show_variable_numbers = FALSE,
round_digits = 2
)
Grouped dfSummary
# Data frame summary stratified by treatment group
grouped_summary <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
group_var = "treatment_group",
show_graphs = TRUE
)
Frequency Tables (freq)
Detailed frequency distributions for categorical variables:
# Basic frequency analysis
freq_result <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service", "usage_frequency", "education")
)
Frequency Tables with Options
# Frequency analysis with cumulative percentages
freq_cumulative <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("recommend_likelihood", "value_perception"),
include_cumulative = TRUE,
report_missing = TRUE
)
Grouped Frequency Analysis
# Frequency analysis by demographic groups
freq_grouped <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service", "will_recommend"),
group_var = "age_group"
)
Descriptive Statistics (descr)
Comprehensive descriptive statistics for numeric variables:
# Basic descriptive statistics
descr_result <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "diastolic_bp", "glucose_mg_dl", "cholesterol_mg_dl")
)
Custom Statistics Selection
# Custom selection of statistics
descr_custom <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "quality_of_life"),
stats_to_include = "custom",
include_mean = TRUE,
include_median = TRUE,
include_sd = TRUE,
include_quartiles = TRUE,
include_skewness = TRUE,
round_digits = 3
)
Descriptive Statistics by Group
# Descriptive statistics stratified by treatment group
descr_grouped <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("systolic_bp", "glucose_mg_dl", "quality_of_life"),
group_var = "treatment_group",
stats_to_include = "all"
)
Different Statistics Sets
# Basic statistics only
descr_basic <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi"),
stats_to_include = "basic"
)
# Central tendency measures
descr_central <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("systolic_bp", "diastolic_bp"),
stats_to_include = "central"
)
# Dispersion measures
descr_dispersion <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
stats_to_include = "dispersion"
)
Cross-tabulation (ctable)
Professional cross-tabulation tables with statistical tests:
# Basic cross-tabulation
ctable_result <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "treatment_group",
cross_var2 = "treatment_response"
)
Cross-tabulation with Proportions
# Cross-tabulation with row proportions
ctable_row <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "gender",
cross_var2 = "bp_category",
cross_proportions = "row"
)
# Cross-tabulation with column proportions
ctable_col <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "smoking_status",
cross_var2 = "diabetes",
cross_proportions = "col"
)
# Cross-tabulation with total proportions
ctable_total <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "study_site",
cross_var2 = "treatment_group",
cross_proportions = "total"
)
Weighted Cross-tabulation
# Add sample weights to data
clinical_weighted <- clinical_data %>%
mutate(sample_weight = runif(nrow(clinical_data), 0.5, 2.0))
# Weighted cross-tabulation
ctable_weighted <- jsummarytools(
data = clinical_weighted,
analysis_type = "ctable",
cross_var1 = "treatment_group",
cross_var2 = "treatment_response",
weights_var = "sample_weight",
cross_proportions = "row"
)
Advanced Features
Output Styling and Formatting
Different Output Styles
# Grid style (default)
style_grid <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service"),
output_style = "grid"
)
# Simple style
style_simple <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service"),
output_style = "simple"
)
# Bootstrap style
style_bootstrap <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service"),
output_style = "bootstrap"
)
# Minimal style
style_minimal <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("satisfaction_service"),
output_style = "minimal"
)
HTML vs ASCII Output
# HTML output (default)
html_output <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "gender", "bmi"),
plain_ascii = FALSE,
bootstrap_css = TRUE
)
# Plain ASCII output
ascii_output <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "gender", "bmi"),
plain_ascii = TRUE
)
Customization Options
# Highly customized output
customized_output <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "systolic_bp", "treatment_group"),
show_labels = TRUE,
show_variable_numbers = TRUE,
show_graphs = TRUE,
show_valid_counts = TRUE,
show_na_counts = TRUE,
round_digits = 3,
max_distinct_values = 15,
headings = TRUE,
escape_pipe = TRUE
)
Missing Data Analysis
Dataset with Missing Values
# Create dataset with missing values
clinical_missing <- clinical_data
set.seed(456)
# Introduce different missing patterns
clinical_missing$age[sample(1:300, 25)] <- NA # MCAR
clinical_missing$weight_kg[sample(1:300, 20)] <- NA # MCAR
# MAR: older patients less likely to report income-related variables
older_patients <- which(clinical_missing$age > 70)
clinical_missing$quality_of_life[sample(older_patients, length(older_patients) * 0.3)] <- NA
# MNAR: patients with diabetes less likely to report glucose
diabetic_patients <- which(clinical_missing$diabetes == "Yes")
clinical_missing$glucose_mg_dl[sample(diabetic_patients, length(diabetic_patients) * 0.2)] <- NA
# Analysis of missing data patterns
missing_summary <- jsummarytools(
data = clinical_missing,
analysis_type = "dfsummary",
show_na_counts = TRUE,
show_valid_counts = TRUE
)
Missing Data in Different Analysis Types
# Descriptive statistics with missing data
missing_descr <- jsummarytools(
data = clinical_missing,
analysis_type = "descr",
vars = c("age", "weight_kg", "glucose_mg_dl", "quality_of_life")
)
# Frequency analysis with missing data reporting
missing_freq <- jsummarytools(
data = clinical_missing,
analysis_type = "freq",
vars = c("diabetes", "treatment_group"),
report_missing = TRUE
)
# Cross-tabulation handling missing values
missing_ctable <- jsummarytools(
data = clinical_missing,
analysis_type = "ctable",
cross_var1 = "diabetes",
cross_var2 = "treatment_group"
)
Real-World Applications
Clinical Trial Analysis
# Comprehensive baseline characteristics table
baseline_characteristics <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
vars = c("age", "gender", "race", "bmi", "systolic_bp", "diastolic_bp",
"diabetes", "hypertension", "smoking_status"),
group_var = "treatment_group",
show_graphs = FALSE,
round_digits = 1
)
# Treatment group comparisons
treatment_comparison <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl", "quality_of_life"),
group_var = "treatment_group",
stats_to_include = "all"
)
# Outcome analysis
outcome_analysis <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "treatment_group",
cross_var2 = "treatment_response",
cross_proportions = "row"
)
Survey Research Analysis
# Demographics overview
demographics_overview <- jsummarytools(
data = survey_data,
analysis_type = "freq",
vars = c("age_group", "gender", "education", "income_bracket", "employment")
)
# Satisfaction analysis
satisfaction_analysis <- jsummarytools(
data = survey_data,
analysis_type = "descr",
vars = c("satisfaction_service", "satisfaction_price", "satisfaction_quality",
"recommend_likelihood", "overall_rating"),
stats_to_include = "all"
)
# Customer segmentation
customer_segments <- jsummarytools(
data = survey_data,
analysis_type = "ctable",
cross_var1 = "usage_frequency",
cross_var2 = "return_customer",
cross_proportions = "row"
)
Laboratory Data Analysis
# Laboratory values overview
lab_overview <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("glucose_mg_dl", "cholesterol_mg_dl", "creatinine_mg_dl", "hemoglobin_g_dl"),
stats_to_include = "all",
round_digits = 2
)
# Laboratory values by patient characteristics
lab_by_diabetes <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
group_var = "diabetes"
)
lab_by_gender <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("hemoglobin_g_dl", "creatinine_mg_dl"),
group_var = "gender"
)
Performance and Optimization
Caching Benefits
# Large dataset for performance testing
large_dataset <- do.call(rbind, replicate(3, clinical_data, simplify = FALSE))
large_dataset$patient_id <- paste0("PT_", sprintf("%04d", 1:nrow(large_dataset)))
# First run (will be slower)
system.time({
perf_result1 <- jsummarytools(
data = large_dataset,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl")
)
})
# Second run with same parameters (should be faster due to caching)
system.time({
perf_result2 <- jsummarytools(
data = large_dataset,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl")
)
})
Memory Efficiency
# Check memory usage with different dataset sizes
datasets <- list(
small = clinical_data[1:50, ],
medium = clinical_data[1:150, ],
large = clinical_data
)
memory_usage <- sapply(names(datasets), function(size) {
gc() # Garbage collection
start_memory <- gc()[2,2] # Memory before
result <- jsummarytools(
data = datasets[[size]],
analysis_type = "dfsummary"
)
end_memory <- gc()[2,2] # Memory after
return(end_memory - start_memory)
})
print(memory_usage)
Advanced Statistical Options
Comprehensive Statistics Selection
# All available statistics
all_stats <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp"),
stats_to_include = "all",
transpose_output = FALSE
)
# Custom statistics with advanced measures
custom_advanced <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
stats_to_include = "custom",
include_mean = TRUE,
include_median = TRUE,
include_mode = FALSE,
include_sd = TRUE,
include_var = TRUE,
include_range = TRUE,
include_quartiles = TRUE,
include_skewness = TRUE,
include_kurtosis = TRUE
)
# Transposed output for publication
transposed_stats <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
transpose_output = TRUE,
round_digits = 2
)
Advanced Cross-tabulation
# Complex cross-tabulation with multiple variables
multi_var_data <- clinical_data %>%
mutate(
age_category = cut(age, breaks = c(0, 50, 65, 100),
labels = c("Under 50", "50-65", "Over 65")),
bmi_category_simple = cut(bmi, breaks = c(0, 25, 30, 50),
labels = c("Normal", "Overweight", "Obese"))
)
# Age and BMI relationship
age_bmi_crosstab <- jsummarytools(
data = multi_var_data,
analysis_type = "ctable",
cross_var1 = "age_category",
cross_var2 = "bmi_category_simple",
cross_proportions = "total"
)
# Treatment and outcome by study site
site_treatment_outcome <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "study_site",
cross_var2 = "treatment_response",
cross_proportions = "col"
)
Error Handling and Edge Cases
Small Sample Sizes
# Very small dataset
tiny_dataset <- clinical_data[1:10, ]
small_summary <- jsummarytools(
data = tiny_dataset,
analysis_type = "dfsummary"
)
small_descr <- jsummarytools(
data = tiny_dataset,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp")
)
Single Variable Analysis
# Single numeric variable
single_numeric <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age")
)
# Single categorical variable
single_categorical <- jsummarytools(
data = clinical_data,
analysis_type = "freq",
vars = c("gender")
)
Unusual Data Distributions
# Create data with unusual characteristics
unusual_data <- data.frame(
id = 1:100,
constant_var = rep(5, 100), # No variation
binary_rare = c(rep(0, 95), rep(1, 5)), # Very rare event
skewed_var = exp(rnorm(100, 0, 2)), # Highly skewed
outlier_var = c(rnorm(95, 0, 1), rnorm(5, 100, 1)), # With outliers
normal_var = rnorm(100, 50, 10)
)
unusual_summary <- jsummarytools(
data = unusual_data,
analysis_type = "dfsummary"
)
unusual_descr <- jsummarytools(
data = unusual_data,
analysis_type = "descr",
vars = c("skewed_var", "outlier_var", "normal_var")
)
Best Practices and Guidelines
Analysis Type Selection Guide
# Create guidance table
analysis_guide <- data.frame(
Analysis_Type = c("dfsummary", "freq", "descr", "ctable"),
Best_For = c(
"Initial data exploration and quality assessment",
"Categorical variable distributions and patterns",
"Comprehensive numeric variable summaries",
"Relationships between categorical variables"
),
When_To_Use = c(
"First look at new dataset, data cleaning",
"Survey responses, demographic analysis",
"Baseline characteristics, outcome measures",
"Treatment comparisons, association testing"
),
Output_Type = c(
"HTML tables with graphs and summaries",
"Frequency tables with percentages",
"Statistical summary tables",
"Cross-tabulation with chi-square tests"
)
)
kable(analysis_guide, caption = "jsummarytools Analysis Type Selection Guide")
Statistical Interpretation Guidelines
# Create interpretation guide
interpretation_guide <- data.frame(
Statistic = c("Mean", "Median", "Standard Deviation", "IQR", "Skewness", "Kurtosis"),
Interpretation = c(
"Average value, sensitive to outliers",
"Middle value, robust to outliers",
"Average distance from mean",
"Range of middle 50% of data",
"Asymmetry of distribution",
"Peakedness of distribution"
),
Normal_Range = c(
"Depends on variable",
"Should be close to mean if normal",
"About 1/6 of range for normal data",
"About 1.35 * SD for normal data",
"0 for normal distribution",
"3 for normal distribution"
)
)
kable(interpretation_guide, caption = "Statistical Interpretation Guide")
Publication-Ready Tables
# Table 1: Baseline characteristics
table1 <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "diastolic_bp", "glucose_mg_dl"),
group_var = "treatment_group",
stats_to_include = "basic",
round_digits = 1
)
# Table 2: Categorical variables summary
table2 <- jsummarytools(
data = clinical_data,
analysis_type = "freq",
vars = c("gender", "race", "diabetes", "hypertension", "smoking_status")
)
# Table 3: Treatment outcomes
table3 <- jsummarytools(
data = clinical_data,
analysis_type = "ctable",
cross_var1 = "treatment_group",
cross_var2 = "treatment_response",
cross_proportions = "row"
)
Integration with Research Workflows
Reproducible Research
# Set parameters for reproducible analysis
analysis_params <- list(
variables_continuous = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
variables_categorical = c("gender", "diabetes", "treatment_group"),
grouping_variable = "treatment_group",
decimal_places = 2,
include_missing = TRUE
)
# Systematic analysis using parameters
systematic_analysis <- function(data, params) {
results <- list()
# Data overview
results$overview <- jsummarytools(
data = data,
analysis_type = "dfsummary",
show_na_counts = params$include_missing
)
# Continuous variables
results$continuous <- jsummarytools(
data = data,
analysis_type = "descr",
vars = params$variables_continuous,
group_var = params$grouping_variable,
round_digits = params$decimal_places
)
# Categorical variables
results$categorical <- jsummarytools(
data = data,
analysis_type = "freq",
vars = params$variables_categorical,
report_missing = params$include_missing
)
return(results)
}
# Run systematic analysis
study_results <- systematic_analysis(clinical_data, analysis_params)
Quality Control and Validation
# Data quality assessment
quality_check <- jsummarytools(
data = clinical_data,
analysis_type = "dfsummary",
show_na_counts = TRUE,
show_valid_counts = TRUE,
max_distinct_values = 20
)
# Outlier detection through descriptive statistics
outlier_detection <- jsummarytools(
data = clinical_data,
analysis_type = "descr",
vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl", "cholesterol_mg_dl"),
stats_to_include = "all"
)
# Completeness assessment
completeness_check <- clinical_data %>%
summarise(
total_observations = n(),
complete_cases = sum(complete.cases(.)),
completeness_rate = round(100 * complete_cases / total_observations, 1)
)
print(completeness_check)
Summary
The jsummarytools
function provides a comprehensive
solution for descriptive statistics in clinical and survey research. Key
advantages include:
Core Capabilities
- Four Analysis Types: dfSummary for data exploration, freq for categorical analysis, descr for numeric summaries, ctable for cross-tabulations
- Professional Output: HTML-formatted tables ready for reports and publications
- Grouping Support: Stratified analyses for treatment comparisons and subgroup analyses
- Missing Data Handling: Comprehensive reporting and appropriate handling of missing values
Performance Features
- Intelligent Caching: Automatic caching of results for faster repeated analyses
- Memory Efficiency: Optimized for large datasets with smart data preparation
- Error Handling: Robust error handling for edge cases and unusual data
- Validation: Comprehensive input validation and informative error messages
Customization Options
- Output Formatting: Multiple styles including HTML, ASCII, and Bootstrap CSS
- Statistical Control: Custom selection of statistics and precision control
- Visual Elements: Optional graphs and charts in dfSummary output
- Publication Ready: Professional formatting suitable for academic publications
Use Cases
- Clinical Research: Baseline characteristics, outcome analysis, safety reporting
- Survey Research: Demographics, satisfaction analysis, response patterns
- Quality Control: Data validation, completeness assessment, outlier detection
- Regulatory Submissions: Standardized statistical summaries for regulatory bodies