Comprehensive Descriptive Statistics with jsummarytools

Introduction

The jsummarytools function provides comprehensive descriptive statistics using the powerful summarytools package. This function is designed specifically for clinical researchers, epidemiologists, and data analysts who need professional-quality descriptive statistics with publication-ready output.

Key Features

Four Analysis Types: Data frame summaries (dfSummary), frequency tables (freq), descriptive statistics (descr), and cross-tabulations (ctable)
Professional Output: HTML-formatted tables and summaries ready for reports and publications
Comprehensive Statistics: Complete descriptive statistics with missing data reporting
Grouping Support: Stratified analyses by grouping variables
Performance Optimized: Intelligent caching system for faster repeated analyses
Flexible Customization: Extensive options for formatting and statistical content

Installation and Setup

# Load required libraries
library(ClinicoPath)

## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'ClinicoPath'

## Warning: replacing previous import 'DiagrammeR::count_automorphisms' by
## 'igraph::count_automorphisms' when loading 'ClinicoPath'

## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'DiagrammeR::get_edge_ids' by
## 'igraph::get_edge_ids' when loading 'ClinicoPath'

## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'igraph::union' by 'lubridate::union' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'igraph::%--%' by 'lubridate::%--%' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::tnr' by 'mlr3measures::tnr' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::precision' by
## 'mlr3measures::precision' when loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::tn' by 'mlr3measures::tn' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::fnr' by 'mlr3measures::fnr' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::tp' by 'mlr3measures::tp' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::npv' by 'mlr3measures::npv' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::ppv' by 'mlr3measures::ppv' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::auc' by 'mlr3measures::auc' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::tpr' by 'mlr3measures::tpr' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::fn' by 'mlr3measures::fn' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::fp' by 'mlr3measures::fp' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::fpr' by 'mlr3measures::fpr' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::recall' by
## 'mlr3measures::recall' when loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::specificity' by
## 'mlr3measures::specificity' when loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::sensitivity' by
## 'mlr3measures::sensitivity' when loading 'ClinicoPath'

## Warning: replacing previous import 'igraph::as_data_frame' by
## 'tibble::as_data_frame' when loading 'ClinicoPath'

## Warning: replacing previous import 'igraph::crossing' by 'tidyr::crossing' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'mlr3measures::sensitivity' by
## 'caret::sensitivity' when loading 'ClinicoPath'

## Warning: replacing previous import 'mlr3measures::specificity' by
## 'caret::specificity' when loading 'ClinicoPath'

## Registered S3 methods overwritten by 'useful':
##   method         from     
##   autoplot.acf   ggfortify
##   fortify.acf    ggfortify
##   fortify.kmeans ggfortify
##   fortify.ts     ggfortify

## Warning: replacing previous import 'jmvcore::select' by 'dplyr::select' when
## loading 'ClinicoPath'

## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2

## Warning: replacing previous import 'DataExplorer::plot_histogram' by
## 'grafify::plot_histogram' when loading 'ClinicoPath'

## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'mlr3measures::auc' by 'pROC::auc' when
## loading 'ClinicoPath'

## Warning: replacing previous import 'cutpointr::roc' by 'pROC::roc' when loading
## 'ClinicoPath'

## Warning: replacing previous import 'tibble::view' by 'summarytools::view' when
## loading 'ClinicoPath'

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

# Set options for better output
options(digits = 3)
knitr::opts_chunk$set(
  fig.width = 12,
  fig.height = 8,
    dpi = 300,
  echo = TRUE,
  eval = FALSE,
  out.width = "100%"
)

# Check if summarytools is available
if (!requireNamespace("summarytools", quietly = TRUE)) {
  message("Note: summarytools package not available. Install with: install.packages('summarytools')")
}

Data Preparation

Let’s create realistic datasets for different types of descriptive analyses:

Clinical Research Data

# Create comprehensive clinical dataset
set.seed(123)
clinical_data <- data.frame(
  # Patient identifiers
  patient_id = paste0("PT_", sprintf("%03d", 1:300)),
  
  # Demographics
  age = round(rnorm(300, 65, 12)),
  gender = factor(sample(c("Male", "Female"), 300, replace = TRUE)),
  race = factor(sample(c("White", "Black", "Hispanic", "Asian", "Other"), 
                      300, replace = TRUE, prob = c(0.65, 0.15, 0.12, 0.06, 0.02))),
  
  # Clinical measurements
  height_cm = round(rnorm(300, 170, 12)),
  weight_kg = round(rnorm(300, 75, 15)),
  systolic_bp = round(rnorm(300, 140, 25)),
  diastolic_bp = round(rnorm(300, 85, 15)),
  heart_rate = round(rnorm(300, 75, 12)),
  
  # Laboratory values
  glucose_mg_dl = round(rnorm(300, 110, 30)),
  cholesterol_mg_dl = round(rnorm(300, 200, 40)),
  creatinine_mg_dl = round(rnorm(300, 1.0, 0.3), 2),
  hemoglobin_g_dl = round(rnorm(300, 13.5, 2.0), 1),
  
  # Medical history
  diabetes = factor(sample(c("No", "Yes"), 300, replace = TRUE, prob = c(0.7, 0.3))),
  hypertension = factor(sample(c("No", "Yes"), 300, replace = TRUE, prob = c(0.6, 0.4))),
  smoking_status = factor(sample(c("Never", "Former", "Current"), 
                                300, replace = TRUE, prob = c(0.5, 0.35, 0.15))),
  
  # Treatment information
  treatment_group = factor(sample(c("Control", "Treatment_A", "Treatment_B"), 
                                 300, replace = TRUE)),
  study_site = factor(sample(c("Site_1", "Site_2", "Site_3"), 300, replace = TRUE)),
  
  # Outcomes
  treatment_response = factor(sample(c("No Response", "Partial", "Complete"), 
                                    300, replace = TRUE, prob = c(0.3, 0.5, 0.2))),
  quality_of_life = round(rnorm(300, 70, 18)),
  adverse_events = rpois(300, 1.2)
) %>%
  mutate(
    # Calculate BMI
    bmi = round(weight_kg / (height_cm/100)^2, 1),
    
    # Create realistic medical relationships
    systolic_bp = pmax(90, systolic_bp + 0.3 * age + 
                      ifelse(diabetes == "Yes", 15, 0) +
                      ifelse(smoking_status == "Current", 10, 0)),
    
    glucose_mg_dl = pmax(70, glucose_mg_dl + 
                        ifelse(diabetes == "Yes", 50, 0) + 
                        0.1 * age),
    
    cholesterol_mg_dl = pmax(120, cholesterol_mg_dl + 0.5 * age + 0.8 * bmi),
    
    # Blood pressure category
    bp_category = factor(case_when(
      systolic_bp < 120 ~ "Normal",
      systolic_bp < 130 ~ "Elevated", 
      systolic_bp < 140 ~ "Stage 1",
      TRUE ~ "Stage 2"
    ), levels = c("Normal", "Elevated", "Stage 1", "Stage 2")),
    
    # BMI category
    bmi_category = factor(case_when(
      bmi < 18.5 ~ "Underweight",
      bmi < 25 ~ "Normal",
      bmi < 30 ~ "Overweight", 
      TRUE ~ "Obese"
    ), levels = c("Underweight", "Normal", "Overweight", "Obese"))
  )

# Display data structure
str(clinical_data)

Survey Research Data

# Create survey/questionnaire dataset
survey_data <- data.frame(
  respondent_id = 1:500,
  
  # Demographics
  age_group = factor(sample(c("18-25", "26-35", "36-50", "51-65", "65+"), 
                           500, replace = TRUE, prob = c(0.15, 0.25, 0.3, 0.2, 0.1))),
  gender = factor(sample(c("Male", "Female", "Non-binary"), 
                        500, replace = TRUE, prob = c(0.48, 0.5, 0.02))),
  education = factor(sample(c("High School", "Some College", "Bachelor", "Graduate"), 
                           500, replace = TRUE, prob = c(0.25, 0.25, 0.35, 0.15))),
  income_bracket = factor(sample(c("Under $30k", "$30k-$60k", "$60k-$100k", "Over $100k"), 
                                500, replace = TRUE, prob = c(0.25, 0.35, 0.25, 0.15))),
  employment = factor(sample(c("Full-time", "Part-time", "Unemployed", "Student", "Retired"), 
                            500, replace = TRUE, prob = c(0.55, 0.15, 0.1, 0.1, 0.1))),
  
  # Likert scale responses (1-5: Strongly Disagree to Strongly Agree)
  satisfaction_service = sample(1:5, 500, replace = TRUE, prob = c(0.05, 0.1, 0.2, 0.45, 0.2)),
  satisfaction_price = sample(1:5, 500, replace = TRUE, prob = c(0.1, 0.15, 0.25, 0.35, 0.15)),
  satisfaction_quality = sample(1:5, 500, replace = TRUE, prob = c(0.03, 0.07, 0.15, 0.5, 0.25)),
  
  # 7-point agreement scales
  recommend_likelihood = sample(1:7, 500, replace = TRUE, 
                               prob = c(0.05, 0.05, 0.1, 0.15, 0.2, 0.25, 0.2)),
  value_perception = sample(1:7, 500, replace = TRUE, 
                           prob = c(0.08, 0.07, 0.12, 0.18, 0.25, 0.2, 0.1)),
  
  # Frequency responses
  usage_frequency = factor(sample(c("Never", "Rarely", "Sometimes", "Often", "Always"), 
                                 500, replace = TRUE, prob = c(0.05, 0.15, 0.3, 0.35, 0.15))),
  
  # Binary choices
  will_recommend = factor(sample(c("No", "Yes"), 500, replace = TRUE, prob = c(0.25, 0.75))),
  return_customer = factor(sample(c("No", "Yes"), 500, replace = TRUE, prob = c(0.3, 0.7))),
  
  # Continuous measures
  years_customer = round(runif(500, 0, 10), 1),
  monthly_spending = round(exp(rnorm(500, 4.5, 0.8)), 2),
  overall_rating = round(rnorm(500, 4.2, 0.8), 1)
) %>%
  mutate(
    # Constrain ratings to valid ranges
    overall_rating = pmax(1, pmin(5, overall_rating))
  )

str(survey_data)

Basic Usage

Data Frame Summary (dfSummary)

The most comprehensive overview of your dataset:

# Basic data frame summary
summary_result <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary"
)

# The result contains comprehensive information about each variable
print(names(summary_result))

Customized dfSummary

# Customized data frame summary with specific options
custom_summary <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  vars = c("age", "gender", "bmi", "systolic_bp", "diabetes", "treatment_group"),
  show_graphs = TRUE,
  show_labels = TRUE,
  show_variable_numbers = FALSE,
  round_digits = 2
)

Grouped dfSummary

# Data frame summary stratified by treatment group
grouped_summary <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary", 
  vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
  group_var = "treatment_group",
  show_graphs = TRUE
)

Frequency Tables (freq)

Detailed frequency distributions for categorical variables:

# Basic frequency analysis
freq_result <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("satisfaction_service", "usage_frequency", "education")
)

Frequency Tables with Options

# Frequency analysis with cumulative percentages
freq_cumulative <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("recommend_likelihood", "value_perception"),
  include_cumulative = TRUE,
  report_missing = TRUE
)

Grouped Frequency Analysis

# Frequency analysis by demographic groups
freq_grouped <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("satisfaction_service", "will_recommend"),
  group_var = "age_group"
)

Descriptive Statistics (descr)

Comprehensive descriptive statistics for numeric variables:

# Basic descriptive statistics
descr_result <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp", "diastolic_bp", "glucose_mg_dl", "cholesterol_mg_dl")
)

Custom Statistics Selection

# Custom selection of statistics
descr_custom <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "quality_of_life"),
  stats_to_include = "custom",
  include_mean = TRUE,
  include_median = TRUE,
  include_sd = TRUE,
  include_quartiles = TRUE,
  include_skewness = TRUE,
  round_digits = 3
)

Descriptive Statistics by Group

# Descriptive statistics stratified by treatment group
descr_grouped <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("systolic_bp", "glucose_mg_dl", "quality_of_life"),
  group_var = "treatment_group",
  stats_to_include = "all"
)

Different Statistics Sets

# Basic statistics only
descr_basic <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi"),
  stats_to_include = "basic"
)

# Central tendency measures
descr_central <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr", 
  vars = c("systolic_bp", "diastolic_bp"),
  stats_to_include = "central"
)

# Dispersion measures
descr_dispersion <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
  stats_to_include = "dispersion"
)

Cross-tabulation (ctable)

Professional cross-tabulation tables with statistical tests:

# Basic cross-tabulation
ctable_result <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "treatment_group",
  cross_var2 = "treatment_response"
)

Cross-tabulation with Proportions

# Cross-tabulation with row proportions
ctable_row <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "gender",
  cross_var2 = "bp_category",
  cross_proportions = "row"
)

# Cross-tabulation with column proportions  
ctable_col <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "smoking_status",
  cross_var2 = "diabetes",
  cross_proportions = "col"
)

# Cross-tabulation with total proportions
ctable_total <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "study_site", 
  cross_var2 = "treatment_group",
  cross_proportions = "total"
)

Weighted Cross-tabulation

# Add sample weights to data
clinical_weighted <- clinical_data %>%
  mutate(sample_weight = runif(nrow(clinical_data), 0.5, 2.0))

# Weighted cross-tabulation
ctable_weighted <- jsummarytools(
  data = clinical_weighted,
  analysis_type = "ctable",
  cross_var1 = "treatment_group",
  cross_var2 = "treatment_response", 
  weights_var = "sample_weight",
  cross_proportions = "row"
)

Advanced Features

Output Styling and Formatting

Different Output Styles

# Grid style (default)
style_grid <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("satisfaction_service"),
  output_style = "grid"
)

# Simple style
style_simple <- jsummarytools(
  data = survey_data,
  analysis_type = "freq", 
  vars = c("satisfaction_service"),
  output_style = "simple"
)

# Bootstrap style
style_bootstrap <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("satisfaction_service"), 
  output_style = "bootstrap"
)

# Minimal style
style_minimal <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("satisfaction_service"),
  output_style = "minimal"
)

HTML vs ASCII Output

# HTML output (default)
html_output <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  vars = c("age", "gender", "bmi"),
  plain_ascii = FALSE,
  bootstrap_css = TRUE
)

# Plain ASCII output
ascii_output <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  vars = c("age", "gender", "bmi"),
  plain_ascii = TRUE
)

Customization Options

# Highly customized output
customized_output <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  vars = c("age", "systolic_bp", "treatment_group"),
  show_labels = TRUE,
  show_variable_numbers = TRUE,
  show_graphs = TRUE,
  show_valid_counts = TRUE,
  show_na_counts = TRUE,
  round_digits = 3,
  max_distinct_values = 15,
  headings = TRUE,
  escape_pipe = TRUE
)

Missing Data Analysis

Dataset with Missing Values

# Create dataset with missing values
clinical_missing <- clinical_data
set.seed(456)

# Introduce different missing patterns
clinical_missing$age[sample(1:300, 25)] <- NA  # MCAR
clinical_missing$weight_kg[sample(1:300, 20)] <- NA  # MCAR

# MAR: older patients less likely to report income-related variables
older_patients <- which(clinical_missing$age > 70)
clinical_missing$quality_of_life[sample(older_patients, length(older_patients) * 0.3)] <- NA

# MNAR: patients with diabetes less likely to report glucose
diabetic_patients <- which(clinical_missing$diabetes == "Yes")
clinical_missing$glucose_mg_dl[sample(diabetic_patients, length(diabetic_patients) * 0.2)] <- NA

# Analysis of missing data patterns
missing_summary <- jsummarytools(
  data = clinical_missing,
  analysis_type = "dfsummary",
  show_na_counts = TRUE,
  show_valid_counts = TRUE
)

Missing Data in Different Analysis Types

# Descriptive statistics with missing data
missing_descr <- jsummarytools(
  data = clinical_missing,
  analysis_type = "descr",
  vars = c("age", "weight_kg", "glucose_mg_dl", "quality_of_life")
)

# Frequency analysis with missing data reporting
missing_freq <- jsummarytools(
  data = clinical_missing,
  analysis_type = "freq",
  vars = c("diabetes", "treatment_group"),
  report_missing = TRUE
)

# Cross-tabulation handling missing values
missing_ctable <- jsummarytools(
  data = clinical_missing,
  analysis_type = "ctable",
  cross_var1 = "diabetes",
  cross_var2 = "treatment_group"
)

Real-World Applications

Clinical Trial Analysis

# Comprehensive baseline characteristics table
baseline_characteristics <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  vars = c("age", "gender", "race", "bmi", "systolic_bp", "diastolic_bp", 
           "diabetes", "hypertension", "smoking_status"),
  group_var = "treatment_group",
  show_graphs = FALSE,
  round_digits = 1
)

# Treatment group comparisons
treatment_comparison <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl", "quality_of_life"),
  group_var = "treatment_group",
  stats_to_include = "all"
)

# Outcome analysis
outcome_analysis <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "treatment_group",
  cross_var2 = "treatment_response",
  cross_proportions = "row"
)

Survey Research Analysis

# Demographics overview
demographics_overview <- jsummarytools(
  data = survey_data,
  analysis_type = "freq",
  vars = c("age_group", "gender", "education", "income_bracket", "employment")
)

# Satisfaction analysis
satisfaction_analysis <- jsummarytools(
  data = survey_data,
  analysis_type = "descr",
  vars = c("satisfaction_service", "satisfaction_price", "satisfaction_quality", 
           "recommend_likelihood", "overall_rating"),
  stats_to_include = "all"
)

# Customer segmentation
customer_segments <- jsummarytools(
  data = survey_data,
  analysis_type = "ctable",
  cross_var1 = "usage_frequency",
  cross_var2 = "return_customer",
  cross_proportions = "row"
)

Laboratory Data Analysis

# Laboratory values overview
lab_overview <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("glucose_mg_dl", "cholesterol_mg_dl", "creatinine_mg_dl", "hemoglobin_g_dl"),
  stats_to_include = "all",
  round_digits = 2
)

# Laboratory values by patient characteristics
lab_by_diabetes <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
  group_var = "diabetes"
)

lab_by_gender <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("hemoglobin_g_dl", "creatinine_mg_dl"),
  group_var = "gender"
)

Performance and Optimization

Caching Benefits

# Large dataset for performance testing
large_dataset <- do.call(rbind, replicate(3, clinical_data, simplify = FALSE))
large_dataset$patient_id <- paste0("PT_", sprintf("%04d", 1:nrow(large_dataset)))

# First run (will be slower)
system.time({
  perf_result1 <- jsummarytools(
    data = large_dataset,
    analysis_type = "descr",
    vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl")
  )
})

# Second run with same parameters (should be faster due to caching)
system.time({
  perf_result2 <- jsummarytools(
    data = large_dataset,
    analysis_type = "descr", 
    vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl")
  )
})

Memory Efficiency

# Check memory usage with different dataset sizes
datasets <- list(
  small = clinical_data[1:50, ],
  medium = clinical_data[1:150, ],
  large = clinical_data
)

memory_usage <- sapply(names(datasets), function(size) {
  gc()  # Garbage collection
  start_memory <- gc()[2,2]  # Memory before
  
  result <- jsummarytools(
    data = datasets[[size]],
    analysis_type = "dfsummary"
  )
  
  end_memory <- gc()[2,2]  # Memory after
  return(end_memory - start_memory)
})

print(memory_usage)

Advanced Statistical Options

Comprehensive Statistics Selection

# All available statistics
all_stats <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp"),
  stats_to_include = "all",
  transpose_output = FALSE
)

# Custom statistics with advanced measures
custom_advanced <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("glucose_mg_dl", "cholesterol_mg_dl"),
  stats_to_include = "custom",
  include_mean = TRUE,
  include_median = TRUE,
  include_mode = FALSE,
  include_sd = TRUE,
  include_var = TRUE,
  include_range = TRUE,
  include_quartiles = TRUE,
  include_skewness = TRUE,
  include_kurtosis = TRUE
)

# Transposed output for publication
transposed_stats <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
  transpose_output = TRUE,
  round_digits = 2
)

Advanced Cross-tabulation

# Complex cross-tabulation with multiple variables
multi_var_data <- clinical_data %>%
  mutate(
    age_category = cut(age, breaks = c(0, 50, 65, 100), 
                      labels = c("Under 50", "50-65", "Over 65")),
    bmi_category_simple = cut(bmi, breaks = c(0, 25, 30, 50),
                             labels = c("Normal", "Overweight", "Obese"))
  )

# Age and BMI relationship
age_bmi_crosstab <- jsummarytools(
  data = multi_var_data,
  analysis_type = "ctable",
  cross_var1 = "age_category",
  cross_var2 = "bmi_category_simple",
  cross_proportions = "total"
)

# Treatment and outcome by study site
site_treatment_outcome <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "study_site",
  cross_var2 = "treatment_response",
  cross_proportions = "col"
)

Error Handling and Edge Cases

Small Sample Sizes

# Very small dataset
tiny_dataset <- clinical_data[1:10, ]

small_summary <- jsummarytools(
  data = tiny_dataset,
  analysis_type = "dfsummary"
)

small_descr <- jsummarytools(
  data = tiny_dataset,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp")
)

Single Variable Analysis

# Single numeric variable
single_numeric <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age")
)

# Single categorical variable
single_categorical <- jsummarytools(
  data = clinical_data,
  analysis_type = "freq",
  vars = c("gender")
)

Unusual Data Distributions

# Create data with unusual characteristics
unusual_data <- data.frame(
  id = 1:100,
  constant_var = rep(5, 100),  # No variation
  binary_rare = c(rep(0, 95), rep(1, 5)),  # Very rare event
  skewed_var = exp(rnorm(100, 0, 2)),  # Highly skewed
  outlier_var = c(rnorm(95, 0, 1), rnorm(5, 100, 1)),  # With outliers
  normal_var = rnorm(100, 50, 10)
)

unusual_summary <- jsummarytools(
  data = unusual_data,
  analysis_type = "dfsummary"
)

unusual_descr <- jsummarytools(
  data = unusual_data,
  analysis_type = "descr",
  vars = c("skewed_var", "outlier_var", "normal_var")
)

Best Practices and Guidelines

Analysis Type Selection Guide

# Create guidance table
analysis_guide <- data.frame(
  Analysis_Type = c("dfsummary", "freq", "descr", "ctable"),
  Best_For = c(
    "Initial data exploration and quality assessment",
    "Categorical variable distributions and patterns",
    "Comprehensive numeric variable summaries",
    "Relationships between categorical variables"
  ),
  When_To_Use = c(
    "First look at new dataset, data cleaning",
    "Survey responses, demographic analysis",
    "Baseline characteristics, outcome measures",
    "Treatment comparisons, association testing"
  ),
  Output_Type = c(
    "HTML tables with graphs and summaries",
    "Frequency tables with percentages",
    "Statistical summary tables",
    "Cross-tabulation with chi-square tests"
  )
)

kable(analysis_guide, caption = "jsummarytools Analysis Type Selection Guide")

Statistical Interpretation Guidelines

# Create interpretation guide
interpretation_guide <- data.frame(
  Statistic = c("Mean", "Median", "Standard Deviation", "IQR", "Skewness", "Kurtosis"),
  Interpretation = c(
    "Average value, sensitive to outliers",
    "Middle value, robust to outliers", 
    "Average distance from mean",
    "Range of middle 50% of data",
    "Asymmetry of distribution",
    "Peakedness of distribution"
  ),
  Normal_Range = c(
    "Depends on variable",
    "Should be close to mean if normal",
    "About 1/6 of range for normal data",
    "About 1.35 * SD for normal data", 
    "0 for normal distribution",
    "3 for normal distribution"
  )
)

kable(interpretation_guide, caption = "Statistical Interpretation Guide")

Publication-Ready Tables

# Table 1: Baseline characteristics
table1 <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp", "diastolic_bp", "glucose_mg_dl"),
  group_var = "treatment_group",
  stats_to_include = "basic",
  round_digits = 1
)

# Table 2: Categorical variables summary
table2 <- jsummarytools(
  data = clinical_data,
  analysis_type = "freq", 
  vars = c("gender", "race", "diabetes", "hypertension", "smoking_status")
)

# Table 3: Treatment outcomes
table3 <- jsummarytools(
  data = clinical_data,
  analysis_type = "ctable",
  cross_var1 = "treatment_group",
  cross_var2 = "treatment_response",
  cross_proportions = "row"
)

Integration with Research Workflows

Reproducible Research

# Set parameters for reproducible analysis
analysis_params <- list(
  variables_continuous = c("age", "bmi", "systolic_bp", "glucose_mg_dl"),
  variables_categorical = c("gender", "diabetes", "treatment_group"),
  grouping_variable = "treatment_group",
  decimal_places = 2,
  include_missing = TRUE
)

# Systematic analysis using parameters
systematic_analysis <- function(data, params) {
  results <- list()
  
  # Data overview
  results$overview <- jsummarytools(
    data = data,
    analysis_type = "dfsummary",
    show_na_counts = params$include_missing
  )
  
  # Continuous variables
  results$continuous <- jsummarytools(
    data = data,
    analysis_type = "descr",
    vars = params$variables_continuous,
    group_var = params$grouping_variable,
    round_digits = params$decimal_places
  )
  
  # Categorical variables
  results$categorical <- jsummarytools(
    data = data,
    analysis_type = "freq",
    vars = params$variables_categorical,
    report_missing = params$include_missing
  )
  
  return(results)
}

# Run systematic analysis
study_results <- systematic_analysis(clinical_data, analysis_params)

Quality Control and Validation

# Data quality assessment
quality_check <- jsummarytools(
  data = clinical_data,
  analysis_type = "dfsummary",
  show_na_counts = TRUE,
  show_valid_counts = TRUE,
  max_distinct_values = 20
)

# Outlier detection through descriptive statistics
outlier_detection <- jsummarytools(
  data = clinical_data,
  analysis_type = "descr",
  vars = c("age", "bmi", "systolic_bp", "glucose_mg_dl", "cholesterol_mg_dl"),
  stats_to_include = "all"
)

# Completeness assessment
completeness_check <- clinical_data %>%
  summarise(
    total_observations = n(),
    complete_cases = sum(complete.cases(.)),
    completeness_rate = round(100 * complete_cases / total_observations, 1)
  )

print(completeness_check)

Summary

The jsummarytools function provides a comprehensive solution for descriptive statistics in clinical and survey research. Key advantages include:

Core Capabilities

Four Analysis Types: dfSummary for data exploration, freq for categorical analysis, descr for numeric summaries, ctable for cross-tabulations
Professional Output: HTML-formatted tables ready for reports and publications
Grouping Support: Stratified analyses for treatment comparisons and subgroup analyses
Missing Data Handling: Comprehensive reporting and appropriate handling of missing values

Performance Features

Intelligent Caching: Automatic caching of results for faster repeated analyses
Memory Efficiency: Optimized for large datasets with smart data preparation
Error Handling: Robust error handling for edge cases and unusual data
Validation: Comprehensive input validation and informative error messages

Customization Options

Output Formatting: Multiple styles including HTML, ASCII, and Bootstrap CSS
Statistical Control: Custom selection of statistics and precision control
Visual Elements: Optional graphs and charts in dfSummary output
Publication Ready: Professional formatting suitable for academic publications

Use Cases

Clinical Research: Baseline characteristics, outcome analysis, safety reporting
Survey Research: Demographics, satisfaction analysis, response patterns
Quality Control: Data validation, completeness assessment, outlier detection
Regulatory Submissions: Standardized statistical summaries for regulatory bodies

Next Steps

Explore different analysis types with your own data
Experiment with grouping variables for comparative analyses
Try different output styles for various reporting needs
Consider the summarytools package documentation for advanced features

sessionInfo()

Professional Data Summaries using summarytools Integration

ClinicoPath

last-modified

Introduction

Key Features

Installation and Setup

Data Preparation

Clinical Research Data

Survey Research Data

Basic Usage

Data Frame Summary (dfSummary)

Customized dfSummary

Grouped dfSummary

Frequency Tables (freq)

Frequency Tables with Options

Grouped Frequency Analysis

Descriptive Statistics (descr)

Custom Statistics Selection

Descriptive Statistics by Group

Different Statistics Sets

Cross-tabulation (ctable)

Cross-tabulation with Proportions

Weighted Cross-tabulation

Advanced Features

Output Styling and Formatting

Different Output Styles

HTML vs ASCII Output

Customization Options

Missing Data Analysis

Dataset with Missing Values

Missing Data in Different Analysis Types

Real-World Applications

Clinical Trial Analysis

Survey Research Analysis

Laboratory Data Analysis

Performance and Optimization

Caching Benefits

Memory Efficiency

Advanced Statistical Options

Comprehensive Statistics Selection

Advanced Cross-tabulation

Error Handling and Edge Cases

Small Sample Sizes

Single Variable Analysis

Unusual Data Distributions

Best Practices and Guidelines

Analysis Type Selection Guide

Statistical Interpretation Guidelines

Publication-Ready Tables

Integration with Research Workflows

Reproducible Research

Quality Control and Validation

Summary

Core Capabilities

Performance Features

Customization Options

Use Cases

Next Steps