Group and Summarize: Data Aggregation and Analysis
ClinicoPath
2025-07-13
Source:vignettes/clinicopath-descriptives-15-groupsummary-comprehensive.Rmd
clinicopath-descriptives-15-groupsummary-comprehensive.Rmd
Introduction to Group and Summarize
The Group and Summarize module provides powerful data aggregation capabilities for grouping data by categorical variables and calculating summary statistics. This tool is essential for exploratory data analysis, reporting, and understanding patterns in complex datasets.
Key Features
- Flexible Grouping: Group by one or multiple categorical variables
- Multiple Statistics: Calculate sum, mean, median, and count for each group
- Date Handling: Advanced date parsing and time-based aggregation
- Visual Analytics: Automatic generation of appropriate charts and plots
- Missing Data Management: Options for handling missing values
- Export Ready: Professional tables and visualizations for reports
Getting Started
Required Data Structure
Your dataset should contain:
- Grouping Variables: Categorical variables to group by (factors, characters, or dates)
- Summary Variables: Numeric variables to calculate statistics for
- Optional Date Variables: For time-based aggregation and trend analysis
Example Datasets
Let’s explore the structure of our test datasets:
## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::count_automorphisms' by
## 'igraph::count_automorphisms' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::get_edge_ids' by
## 'igraph::get_edge_ids' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::union' by 'lubridate::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::%--%' by 'lubridate::%--%' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tnr' by 'mlr3measures::tnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::precision' by
## 'mlr3measures::precision' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tn' by 'mlr3measures::tn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fnr' by 'mlr3measures::fnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tp' by 'mlr3measures::tp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::npv' by 'mlr3measures::npv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::ppv' by 'mlr3measures::ppv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::auc' by 'mlr3measures::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tpr' by 'mlr3measures::tpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fn' by 'mlr3measures::fn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fp' by 'mlr3measures::fp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fpr' by 'mlr3measures::fpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::recall' by
## 'mlr3measures::recall' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::specificity' by
## 'mlr3measures::specificity' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::sensitivity' by
## 'mlr3measures::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::as_data_frame' by
## 'tibble::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::crossing' by 'tidyr::crossing' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::sensitivity' by
## 'caret::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::specificity' by
## 'caret::specificity' when loading 'ClinicoPath'
## Registered S3 methods overwritten by 'useful':
## method from
## autoplot.acf ggfortify
## fortify.acf ggfortify
## fortify.kmeans ggfortify
## fortify.ts ggfortify
## Warning: replacing previous import 'jmvcore::select' by 'dplyr::select' when
## loading 'ClinicoPath'
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
## Warning: replacing previous import 'DataExplorer::plot_histogram' by
## 'grafify::plot_histogram' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::auc' by 'pROC::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::roc' by 'pROC::roc' when loading
## 'ClinicoPath'
## Warning: replacing previous import 'tibble::view' by 'summarytools::view' when
## loading 'ClinicoPath'
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Load example datasets
data("groupsummary_simple")
data("groupsummary_sales_data")
data("medical_research_data")
# Examine simple dataset structure
str(groupsummary_simple)
## 'data.frame': 100 obs. of 5 variables:
## $ category : Factor w/ 3 levels "A","B","C": 1 2 2 2 3 1 3 3 3 2 ...
## $ group : Factor w/ 2 levels "Group1","Group2": 2 2 2 2 1 2 1 1 2 1 ...
## $ value1 : num 100.2 99 122.5 116.7 96.2 ...
## $ value2 : num 9.7 7.6 65.3 140.2 36.3 ...
## $ count_var: int 8 4 8 9 8 6 1 2 4 7 ...
# Preview different datasets
head(groupsummary_simple) %>%
knitr::kable(caption = "Simple Grouping Dataset")
category | group | value1 | value2 | count_var |
---|---|---|---|---|
A | Group2 | 100.2 | 9.7 | 8 |
B | Group2 | 99.0 | 7.6 | 4 |
B | Group2 | 122.5 | 65.3 | 8 |
B | Group2 | 116.7 | 140.2 | 9 |
C | Group1 | 96.2 | 36.3 | 8 |
A | Group2 | 121.6 | 32.4 | 6 |
Basic Grouping and Summarization
Simple Single-Variable Grouping
Let’s start with basic grouping by a single categorical variable:
# Basic grouping by category
result_basic <- groupsummary(
data = groupsummary_simple,
groupVars = "category",
sumVars = c("value1", "value2"),
statistics = c("sum", "mean", "n")
)
# The result object contains the analysis
class(result_basic)
## [1] "groupsummaryResults" "Group" "ResultsElement"
## [4] "R6"
Multiple Statistics Calculation
Calculate different summary statistics for the same data:
# All available statistics
result_all_stats <- groupsummary(
data = groupsummary_simple,
groupVars = "category",
sumVars = "value1",
statistics = c("sum", "mean", "median", "n"),
addPercentage = TRUE,
sortBy = "first_desc"
)
Multi-Variable Grouping
Group by multiple categorical variables simultaneously:
# Group by category and group
result_multi <- groupsummary(
data = groupsummary_simple,
groupVars = c("category", "group"),
sumVars = c("value1", "value2"),
statistics = c("sum", "mean", "n"),
sortBy = "first_desc"
)
Business Analytics Applications
Sales Performance Analysis
Analyze sales data by region and product category:
# Examine sales data structure
str(groupsummary_sales_data)
## 'data.frame': 300 obs. of 10 variables:
## $ sale_date : Date, format: "2023-02-04" "2022-02-14" ...
## $ region : Factor w/ 4 levels "East","North",..: 1 3 4 2 3 1 4 2 3 3 ...
## $ product_category : Factor w/ 4 levels "Clothing","Electronics",..: 3 1 2 2 1 3 4 2 1 2 ...
## $ sales_channel : Factor w/ 3 levels "Online","Phone",..: 1 1 1 3 3 1 1 1 2 1 ...
## $ customer_type : Factor w/ 3 levels "New","Returning",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ sales_amount : num 713 287 163 237 810 ...
## $ quantity_sold : int 12 5 11 6 14 17 17 6 16 6 ...
## $ profit_margin : num 0.235 0.142 0.131 0.285 0.182 0.273 0.141 0.373 0.12 0.193 ...
## $ discount_applied : num 0.049 0.244 0.143 0.029 0.165 0.152 0.088 0.026 0.193 0.256 ...
## $ customer_satisfaction: int 5 2 3 4 1 4 5 1 5 1 ...
# Sales by region and product category
result_sales <- groupsummary(
data = groupsummary_sales_data,
groupVars = c("region", "product_category"),
sumVars = c("sales_amount", "quantity_sold"),
statistics = c("sum", "mean", "median", "n"),
addPercentage = TRUE,
sortBy = "first_desc"
)
Customer Segmentation
Analyze customer behavior patterns:
# Customer analysis by channel and type
result_customer <- groupsummary(
data = groupsummary_sales_data,
groupVars = c("sales_channel", "customer_type"),
sumVars = c("sales_amount", "profit_margin"),
statistics = c("mean", "median", "n"),
sortBy = "first_desc"
)
Regional Performance Comparison
Compare performance across different regions:
# Regional performance summary
result_regional <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = c("sales_amount", "quantity_sold", "profit_margin"),
statistics = c("sum", "mean", "n"),
addPercentage = TRUE
)
Time-Based Analysis and Date Handling
Monthly Sales Trends
Analyze sales trends over time using date aggregation:
# Monthly sales aggregation
result_monthly <- groupsummary(
data = groupsummary_sales_data,
groupVars = "sale_date",
sumVars = c("sales_amount", "quantity_sold"),
statistics = c("sum", "mean", "n"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "month",
sortBy = "groups"
)
Weekly Performance Patterns
Examine weekly patterns in sales data:
# Weekly sales patterns
result_weekly <- groupsummary(
data = groupsummary_sales_data,
groupVars = "sale_date",
sumVars = "sales_amount",
statistics = c("sum", "n"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "week"
)
Time Series with Grouping
Combine time aggregation with categorical grouping:
# Monthly sales by region
result_time_grouped <- groupsummary(
data = groupsummary_sales_data,
groupVars = c("sale_date", "region"),
sumVars = "sales_amount",
statistics = c("sum", "mean"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "month"
)
Advanced Date and Time Handling
Financial Data with Timestamps
Work with precise timestamp data for financial analysis:
# Load financial data
data("groupsummary_financial_data")
# Examine timestamp structure
head(groupsummary_financial_data$transaction_time)
## [1] "2023-01-01 09:00:00 +03" "2023-01-01 10:00:00 +03"
## [3] "2023-01-01 11:00:00 +03" "2023-01-01 12:00:00 +03"
## [5] "2023-01-01 13:00:00 +03" "2023-01-01 14:00:00 +03"
# Hourly trading activity
result_hourly <- groupsummary(
data = groupsummary_financial_data,
groupVars = "transaction_time",
sumVars = c("transaction_amount", "volume_traded"),
statistics = c("sum", "mean", "n"),
dateVar = "transaction_time",
dateFormat = "ymd_hms",
timeAggregation = "hour"
)
## Warning: 21 failed to parse.
Daily Market Analysis
Aggregate financial data by day and market sector:
# Daily market sector analysis
result_daily_market <- groupsummary(
data = groupsummary_financial_data,
groupVars = c("transaction_time", "market_sector"),
sumVars = c("transaction_amount", "portfolio_value"),
statistics = c("sum", "mean", "n"),
dateVar = "transaction_time",
dateFormat = "ymd_hms",
timeAggregation = "day"
)
## Warning: 21 failed to parse.
Survey and Social Science Research
Demographic Analysis
Analyze survey responses by demographic groups:
# Load survey data
data("groupsummary_survey_data")
# Demographic summary
str(groupsummary_survey_data)
## 'data.frame': 250 obs. of 12 variables:
## $ respondent_id : chr "R001" "R002" "R003" "R004" ...
## $ survey_date : Date, format: "2023-08-18" "2023-01-10" ...
## $ age_group : Ord.factor w/ 5 levels "18-25"<"26-35"<..: 4 1 4 5 5 2 4 3 5 2 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 1 1 1 1 1 1 1 1 ...
## $ education : Ord.factor w/ 3 levels "High School"<..: 2 2 3 3 1 1 1 1 1 3 ...
## $ income_bracket : Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 2 1 1 2 2 1 2 2 1 ...
## $ location : Factor w/ 3 levels "Rural","Suburban",..: 3 1 1 2 2 1 1 3 3 3 ...
## $ satisfaction_score : int 2 1 3 8 6 5 7 4 6 6 ...
## $ trust_score : int 6 2 3 2 6 5 1 1 4 1 ...
## $ likelihood_recommend: int 1 3 6 1 0 1 8 0 9 0 ...
## $ time_spent_minutes : num 12 40 9 6 38 4 3 20 33 23 ...
## $ number_of_issues : int 0 0 5 2 2 5 3 2 0 4 ...
# Satisfaction by age group and gender
result_demographics <- groupsummary(
data = groupsummary_survey_data,
groupVars = c("age_group", "gender"),
sumVars = c("satisfaction_score", "trust_score", "likelihood_recommend"),
statistics = c("mean", "median", "n"),
sortBy = "first_desc"
)
Education Level Impact
Examine the relationship between education and survey responses:
# Education level analysis
result_education <- groupsummary(
data = groupsummary_survey_data,
groupVars = "education",
sumVars = c("satisfaction_score", "trust_score", "time_spent_minutes"),
statistics = c("mean", "median", "n"),
sortBy = "first_desc"
)
Income Bracket Analysis
Analyze responses by income levels:
# Income bracket analysis
result_income <- groupsummary(
data = groupsummary_survey_data,
groupVars = c("income_bracket", "location"),
sumVars = c("satisfaction_score", "likelihood_recommend"),
statistics = c("mean", "n"),
sortBy = "first_desc"
)
Manufacturing and Quality Control
Production Line Analysis
Analyze manufacturing data by production parameters:
# Load manufacturing data
data("groupsummary_manufacturing_data")
# Production line performance
result_production <- groupsummary(
data = groupsummary_manufacturing_data,
groupVars = c("production_line", "shift"),
sumVars = c("units_produced", "defect_count", "efficiency_percent"),
statistics = c("sum", "mean", "n"),
sortBy = "first_desc"
)
Quality Analysis by Grade
Examine quality metrics across different grades:
# Quality grade analysis
result_quality <- groupsummary(
data = groupsummary_manufacturing_data,
groupVars = c("quality_grade", "operator_experience"),
sumVars = c("units_produced", "defect_count", "cycle_time_minutes"),
statistics = c("mean", "n"),
sortBy = "first_desc"
)
Monthly Production Trends
Track production trends over time:
# Monthly production trends
result_production_trends <- groupsummary(
data = groupsummary_manufacturing_data,
groupVars = "production_date",
sumVars = c("units_produced", "defect_count", "efficiency_percent"),
statistics = c("sum", "mean"),
dateVar = "production_date",
dateFormat = "ymd",
timeAggregation = "month"
)
Web Analytics and Digital Marketing
Traffic Source Analysis
Analyze website traffic by source and device:
# Load web analytics data
data("groupsummary_web_analytics")
# Traffic source performance
result_traffic <- groupsummary(
data = groupsummary_web_analytics,
groupVars = c("traffic_source", "device_type"),
sumVars = c("page_views", "session_duration_minutes", "conversion_value"),
statistics = c("sum", "mean", "n"),
sortBy = "first_desc"
)
User Behavior Analysis
Examine user behavior patterns:
# User type and country analysis
result_users <- groupsummary(
data = groupsummary_web_analytics,
groupVars = c("user_type", "country"),
sumVars = c("page_views", "session_duration_minutes", "bounce_rate_percent"),
statistics = c("mean", "n"),
sortBy = "first_desc"
)
Daily Traffic Patterns
Analyze daily traffic patterns:
# Daily traffic analysis
result_daily_traffic <- groupsummary(
data = groupsummary_web_analytics,
groupVars = "session_time",
sumVars = c("page_views", "conversion_value"),
statistics = c("sum", "mean", "n"),
dateVar = "session_time",
dateFormat = "ymd_hms",
timeAggregation = "day"
)
## Warning: 25 failed to parse.
Clinical Research Applications
Multi-Center Study Analysis
Analyze clinical trial data across multiple centers:
# Examine medical research data
str(medical_research_data)
## 'data.frame': 890 obs. of 32 variables:
## $ PatientID : chr "PT001" "PT001" "PT001" "PT002" ...
## $ VisitNumber : int 1 2 3 1 2 3 1 2 3 4 ...
## $ VisitDate : Date, format: "2022-01-01" "2022-02-09" ...
## $ StudyCenter : Factor w/ 5 levels "Boston Medical Center",..: 1 1 1 3 3 3 1 1 1 1 ...
## $ AgeGroup : Ord.factor w/ 5 levels "18-30"<"31-45"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 1 1 1 1 ...
## $ TreatmentGroup : Factor w/ 4 levels "Control","Treatment A",..: 2 2 2 3 3 3 2 2 2 2 ...
## $ DiagnosisPrimary : Factor w/ 5 levels "Chronic Kidney Disease",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ DiseaseStage : Ord.factor w/ 4 levels "Stage I"<"Stage II"<..: 2 2 2 2 2 2 1 1 1 1 ...
## $ ComorbidityCount : int 3 3 3 3 3 3 2 2 2 2 ...
## $ BMICategory : Ord.factor w/ 4 levels "Underweight"<..: 2 2 2 3 3 3 2 2 2 2 ...
## $ SystolicBP : int 142 141 143 120 118 125 139 145 124 133 ...
## $ DiastolicBP : int 111 103 111 96 94 94 101 109 89 106 ...
## $ HeartRate : int 84 66 55 76 88 72 83 101 66 71 ...
## $ Temperature : num 36.6 37.9 36.7 36.8 37.5 36.7 36 36.5 36.8 36.8 ...
## $ RespiratoryRate : int 16 11 16 15 18 27 12 20 14 17 ...
## $ OxygenSaturation : int 98 98 94 98 100 98 100 96 96 98 ...
## $ HbA1c : num NA 9.4 9 6.7 9.1 10.4 10.3 5.9 8.8 NA ...
## $ Creatinine : num 0.74 0.6 0.46 0.73 1.43 1.08 1.08 0.63 0.97 NA ...
## $ eGFR : int 89 85 92 89 80 81 84 NA 94 87 ...
## $ Hemoglobin : num 14.1 11.5 NA 10.6 13.2 13.6 12.8 NA 14.2 13.8 ...
## $ WBC : num NA 10 9.1 7.5 5.7 NA NA NA 3.9 4.2 ...
## $ Platelets : int 238 225 295 254 297 181 186 141 NA 249 ...
## $ TotalCholesterol : int 130 NA 165 236 216 198 146 216 154 192 ...
## $ LDLCholesterol : int 95 136 132 145 136 NA 113 148 146 84 ...
## $ HDLCholesterol : int 41 62 50 36 64 NA 50 41 48 58 ...
## $ Triglycerides : int 148 117 53 226 104 108 103 163 199 NA ...
## $ PainScore : int NA NA 1 4 7 2 4 4 3 NA ...
## $ QualityOfLife : int 41 58 75 53 64 51 65 61 68 87 ...
## $ FunctionalStatus : int 91 97 69 63 74 76 45 100 49 87 ...
## $ MedicationAdherence: int 88 100 97 100 87 88 73 100 69 60 ...
## $ ClinicalScore : int 59 55 70 50 46 67 47 40 67 46 ...
# Multi-center treatment analysis
result_clinical <- groupsummary(
data = medical_research_data,
groupVars = c("StudyCenter", "TreatmentGroup"),
sumVars = c("ClinicalScore", "SystolicBP", "HbA1c"),
statistics = c("mean", "n"),
showMissing = FALSE,
sortBy = "first_desc"
)
Disease Stage Analysis
Examine outcomes by disease stage:
# Disease stage analysis
result_disease <- groupsummary(
data = medical_research_data,
groupVars = c("DiseaseStage", "TreatmentGroup"),
sumVars = c("ClinicalScore", "PainScore", "QualityOfLife"),
statistics = c("mean", "median", "n"),
showMissing = FALSE
)
Longitudinal Visit Patterns
Analyze visit patterns over time:
# Monthly visit patterns
result_visits <- groupsummary(
data = medical_research_data,
groupVars = "VisitDate",
sumVars = "ClinicalScore",
statistics = c("n"),
dateVar = "VisitDate",
dateFormat = "ymd",
timeAggregation = "month"
)
Hospital Hourly Monitoring
Analyze hourly patient monitoring data:
# Load hospital hourly data
data("hospital_admission_hourly")
# Hourly vital signs by department
result_hourly_vitals <- groupsummary(
data = hospital_admission_hourly,
groupVars = c("AdmissionTime", "Department"),
sumVars = c("HeartRate", "SystolicBP", "OxygenSaturation", "PainScore"),
statistics = c("mean", "n"),
dateVar = "AdmissionTime",
dateFormat = "ymd_hms",
timeAggregation = "hour"
)
## Warning: 50 failed to parse.
Advanced Features and Customization
Handling Missing Values
Control how missing values are handled in your analysis:
# Create data with missing values for demonstration
test_data <- medical_research_data
test_data$ClinicalScore[1:10] <- NA
# Analysis excluding missing values
result_exclude_missing <- groupsummary(
data = test_data,
groupVars = "TreatmentGroup",
sumVars = "ClinicalScore",
statistics = c("mean", "n"),
showMissing = FALSE
)
# Analysis including missing values
result_include_missing <- groupsummary(
data = test_data,
groupVars = "TreatmentGroup",
sumVars = "ClinicalScore",
statistics = c("mean", "n"),
showMissing = TRUE
)
Sorting and Display Options
Customize how results are sorted and displayed:
# Sort by group names (alphabetical)
result_sort_groups <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = "sales_amount",
statistics = c("sum", "mean"),
sortBy = "groups"
)
# Sort by first summary variable (descending)
result_sort_desc <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = "sales_amount",
statistics = c("sum", "mean"),
sortBy = "first_desc"
)
# Sort by first summary variable (ascending)
result_sort_asc <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = "sales_amount",
statistics = c("sum", "mean"),
sortBy = "first_asc"
)
Percentage Calculations
Add percentage calculations to your summaries:
# Sales by region with percentages
result_percentages <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = c("sales_amount", "quantity_sold"),
statistics = c("sum", "mean", "n"),
addPercentage = TRUE,
sortBy = "first_desc"
)
Date Format and Time Aggregation Options
Multiple Date Formats
Handle different date formats in your data:
# Create test data with different date formats
date_test_data <- data.frame(
date_ymd = as.Date(c("2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05")),
value = c(100, 200, 150, 300),
category = c("A", "B", "A", "B"),
stringsAsFactors = FALSE
)
# YMD format analysis
result_ymd <- groupsummary(
data = date_test_data,
groupVars = "date_ymd",
sumVars = "value",
statistics = c("sum", "mean"),
dateVar = "date_ymd",
dateFormat = "ymd",
timeAggregation = "month"
)
Time Aggregation Levels
Explore different time aggregation options:
# Daily aggregation
result_daily <- groupsummary(
data = groupsummary_sales_data,
groupVars = "sale_date",
sumVars = "sales_amount",
statistics = c("sum"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "day"
)
# Weekly aggregation
result_weekly_agg <- groupsummary(
data = groupsummary_sales_data,
groupVars = "sale_date",
sumVars = "sales_amount",
statistics = c("sum"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "week"
)
# Yearly aggregation
result_yearly <- groupsummary(
data = groupsummary_sales_data,
groupVars = "sale_date",
sumVars = "sales_amount",
statistics = c("sum"),
dateVar = "sale_date",
dateFormat = "ymd",
timeAggregation = "year"
)
Statistical Considerations
Understanding Summary Statistics
Sum
- Use Case: Total values, cumulative metrics
- Example: Total sales revenue, total production units
- Interpretation: Represents the aggregate amount across all observations in each group
Mean
- Use Case: Average performance, typical values
- Example: Average customer satisfaction, mean response time
- Interpretation: Central tendency that can be influenced by outliers
Sample Size Considerations
Understanding the impact of group sizes on analysis:
# Analyze sample sizes across groups
sample_size_analysis <- groupsummary(
data = medical_research_data,
groupVars = c("StudyCenter", "TreatmentGroup"),
sumVars = "ClinicalScore",
statistics = c("n"),
sortBy = "first_desc"
)
Dealing with Skewed Data
Strategies for handling skewed distributions:
# Compare mean vs median for potentially skewed data
skewness_check <- groupsummary(
data = groupsummary_financial_data,
groupVars = "market_sector",
sumVars = c("transaction_amount", "portfolio_value"),
statistics = c("mean", "median", "n"),
sortBy = "first_desc"
)
Visualization and Interpretation
Understanding the Generated Plots
The Group and Summarize module automatically generates appropriate visualizations based on your data:
Bar Charts
- When: Categorical grouping variables
- Shows: Comparison of summary statistics across groups
- Best For: Regional comparisons, category analysis
Best Practices and Recommendations
Data Preparation
Analysis Strategy
Troubleshooting Common Issues
Data Type Issues
Date Formatting Problems
# Check date formats before analysis
str(groupsummary_sales_data$sale_date)
## Date[1:300], format: "2023-02-04" "2022-02-14" "2022-03-09" "2023-06-14" "2022-02-17" ...
# Convert if necessary
# groupsummary_sales_data$sale_date <- as.Date(groupsummary_sales_data$sale_date, format = "%Y-%m-%d")
Missing Data Strategies
Understanding Missing Patterns
# Check missing data patterns
missing_summary <- medical_research_data %>%
summarise(across(everything(), ~ sum(is.na(.))))
print(missing_summary)
## PatientID VisitNumber VisitDate StudyCenter AgeGroup Gender TreatmentGroup
## 1 0 0 0 0 0 0 0
## DiagnosisPrimary DiseaseStage ComorbidityCount BMICategory SystolicBP
## 1 0 0 0 0 0
## DiastolicBP HeartRate Temperature RespiratoryRate OxygenSaturation HbA1c
## 1 0 0 0 0 0 125
## Creatinine eGFR Hemoglobin WBC Platelets TotalCholesterol LDLCholesterol
## 1 122 104 113 129 120 126 130
## HDLCholesterol Triglycerides PainScore QualityOfLife FunctionalStatus
## 1 125 112 81 88 76
## MedicationAdherence ClinicalScore
## 1 0 0
Integration with Other Modules
Workflow Integration
The Group and Summarize module works well with other ClinicoPath modules:
Advanced Use Cases
Quality Control Monitoring
Set up regular quality monitoring using groupsummary:
# Daily quality metrics
quality_monitoring <- groupsummary(
data = groupsummary_manufacturing_data,
groupVars = "production_date",
sumVars = c("defect_count", "efficiency_percent"),
statistics = c("mean", "sum"),
dateVar = "production_date",
dateFormat = "ymd",
timeAggregation = "day"
)
Customer Segmentation
Create customer segments based on behavior:
# Customer value segmentation
customer_segments <- groupsummary(
data = groupsummary_sales_data,
groupVars = c("customer_type", "sales_channel"),
sumVars = c("sales_amount", "profit_margin"),
statistics = c("mean", "sum", "n"),
addPercentage = TRUE,
sortBy = "first_desc"
)
Performance Benchmarking
Compare performance across different dimensions:
# Regional performance benchmarking
regional_benchmark <- groupsummary(
data = groupsummary_sales_data,
groupVars = "region",
sumVars = c("sales_amount", "profit_margin", "customer_satisfaction"),
statistics = c("mean", "median", "n"),
sortBy = "first_desc"
)
Summary
The Group and Summarize module provides a powerful and flexible tool for data aggregation and analysis. Key benefits include:
Analytical Power
- Flexible Grouping: Support for single and multiple grouping variables
- Multiple Statistics: Comprehensive statistical summaries
- Time Intelligence: Advanced date handling and aggregation
Practical Applications
- Business Analytics: Sales, customer, and performance analysis
- Clinical Research: Multi-center studies and patient outcomes
- Quality Control: Manufacturing and process monitoring
- Digital Analytics: Web traffic and user behavior analysis
User-Friendly Features
- Automatic Visualization: Context-appropriate charts and plots
- Professional Output: Publication-ready tables and graphics
- Flexible Options: Customizable sorting, formatting, and display
This module serves as a cornerstone for exploratory data analysis and reporting, enabling users to quickly understand patterns, trends, and relationships in their data across various domains and applications.