Skip to contents

Introduction to Group and Summarize

The Group and Summarize module provides powerful data aggregation capabilities for grouping data by categorical variables and calculating summary statistics. This tool is essential for exploratory data analysis, reporting, and understanding patterns in complex datasets.

Key Features

  • Flexible Grouping: Group by one or multiple categorical variables
  • Multiple Statistics: Calculate sum, mean, median, and count for each group
  • Date Handling: Advanced date parsing and time-based aggregation
  • Visual Analytics: Automatic generation of appropriate charts and plots
  • Missing Data Management: Options for handling missing values
  • Export Ready: Professional tables and visualizations for reports

Applications

Business Analytics

  • Sales performance by region and product
  • Customer segmentation analysis
  • Financial reporting and KPI tracking

Clinical Research

  • Patient outcomes by treatment group
  • Multi-center study summaries
  • Longitudinal data analysis

Quality Control

  • Manufacturing metrics by shift and line
  • Defect analysis by production parameters
  • Performance monitoring over time

Getting Started

Required Data Structure

Your dataset should contain:

  1. Grouping Variables: Categorical variables to group by (factors, characters, or dates)
  2. Summary Variables: Numeric variables to calculate statistics for
  3. Optional Date Variables: For time-based aggregation and trend analysis

Example Datasets

Let’s explore the structure of our test datasets:

## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::count_automorphisms' by
## 'igraph::count_automorphisms' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::get_edge_ids' by
## 'igraph::get_edge_ids' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::union' by 'lubridate::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::%--%' by 'lubridate::%--%' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tnr' by 'mlr3measures::tnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::precision' by
## 'mlr3measures::precision' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tn' by 'mlr3measures::tn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fnr' by 'mlr3measures::fnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tp' by 'mlr3measures::tp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::npv' by 'mlr3measures::npv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::ppv' by 'mlr3measures::ppv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::auc' by 'mlr3measures::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tpr' by 'mlr3measures::tpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fn' by 'mlr3measures::fn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fp' by 'mlr3measures::fp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fpr' by 'mlr3measures::fpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::recall' by
## 'mlr3measures::recall' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::specificity' by
## 'mlr3measures::specificity' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::sensitivity' by
## 'mlr3measures::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::as_data_frame' by
## 'tibble::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::crossing' by 'tidyr::crossing' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::sensitivity' by
## 'caret::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::specificity' by
## 'caret::specificity' when loading 'ClinicoPath'
## Registered S3 methods overwritten by 'useful':
##   method         from     
##   autoplot.acf   ggfortify
##   fortify.acf    ggfortify
##   fortify.kmeans ggfortify
##   fortify.ts     ggfortify
## Warning: replacing previous import 'jmvcore::select' by 'dplyr::select' when
## loading 'ClinicoPath'
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## Warning: replacing previous import 'DataExplorer::plot_histogram' by
## 'grafify::plot_histogram' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::auc' by 'pROC::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::roc' by 'pROC::roc' when loading
## 'ClinicoPath'
## Warning: replacing previous import 'tibble::view' by 'summarytools::view' when
## loading 'ClinicoPath'
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Load example datasets
data("groupsummary_simple")
data("groupsummary_sales_data")
data("medical_research_data")

# Examine simple dataset structure
str(groupsummary_simple)
## 'data.frame':    100 obs. of  5 variables:
##  $ category : Factor w/ 3 levels "A","B","C": 1 2 2 2 3 1 3 3 3 2 ...
##  $ group    : Factor w/ 2 levels "Group1","Group2": 2 2 2 2 1 2 1 1 2 1 ...
##  $ value1   : num  100.2 99 122.5 116.7 96.2 ...
##  $ value2   : num  9.7 7.6 65.3 140.2 36.3 ...
##  $ count_var: int  8 4 8 9 8 6 1 2 4 7 ...
# Preview different datasets
head(groupsummary_simple) %>%
  knitr::kable(caption = "Simple Grouping Dataset")
Simple Grouping Dataset
category group value1 value2 count_var
A Group2 100.2 9.7 8
B Group2 99.0 7.6 4
B Group2 122.5 65.3 8
B Group2 116.7 140.2 9
C Group1 96.2 36.3 8
A Group2 121.6 32.4 6

Basic Grouping and Summarization

Simple Single-Variable Grouping

Let’s start with basic grouping by a single categorical variable:

# Basic grouping by category
result_basic <- groupsummary(
  data = groupsummary_simple,
  groupVars = "category",
  sumVars = c("value1", "value2"),
  statistics = c("sum", "mean", "n")
)

# The result object contains the analysis
class(result_basic)
## [1] "groupsummaryResults" "Group"               "ResultsElement"     
## [4] "R6"

Multiple Statistics Calculation

Calculate different summary statistics for the same data:

# All available statistics
result_all_stats <- groupsummary(
  data = groupsummary_simple,
  groupVars = "category",
  sumVars = "value1",
  statistics = c("sum", "mean", "median", "n"),
  addPercentage = TRUE,
  sortBy = "first_desc"
)

Multi-Variable Grouping

Group by multiple categorical variables simultaneously:

# Group by category and group
result_multi <- groupsummary(
  data = groupsummary_simple,
  groupVars = c("category", "group"),
  sumVars = c("value1", "value2"),
  statistics = c("sum", "mean", "n"),
  sortBy = "first_desc"
)

Business Analytics Applications

Sales Performance Analysis

Analyze sales data by region and product category:

# Examine sales data structure
str(groupsummary_sales_data)
## 'data.frame':    300 obs. of  10 variables:
##  $ sale_date            : Date, format: "2023-02-04" "2022-02-14" ...
##  $ region               : Factor w/ 4 levels "East","North",..: 1 3 4 2 3 1 4 2 3 3 ...
##  $ product_category     : Factor w/ 4 levels "Clothing","Electronics",..: 3 1 2 2 1 3 4 2 1 2 ...
##  $ sales_channel        : Factor w/ 3 levels "Online","Phone",..: 1 1 1 3 3 1 1 1 2 1 ...
##  $ customer_type        : Factor w/ 3 levels "New","Returning",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ sales_amount         : num  713 287 163 237 810 ...
##  $ quantity_sold        : int  12 5 11 6 14 17 17 6 16 6 ...
##  $ profit_margin        : num  0.235 0.142 0.131 0.285 0.182 0.273 0.141 0.373 0.12 0.193 ...
##  $ discount_applied     : num  0.049 0.244 0.143 0.029 0.165 0.152 0.088 0.026 0.193 0.256 ...
##  $ customer_satisfaction: int  5 2 3 4 1 4 5 1 5 1 ...
# Sales by region and product category
result_sales <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = c("region", "product_category"),
  sumVars = c("sales_amount", "quantity_sold"),
  statistics = c("sum", "mean", "median", "n"),
  addPercentage = TRUE,
  sortBy = "first_desc"
)

Customer Segmentation

Analyze customer behavior patterns:

# Customer analysis by channel and type
result_customer <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = c("sales_channel", "customer_type"),
  sumVars = c("sales_amount", "profit_margin"),
  statistics = c("mean", "median", "n"),
  sortBy = "first_desc"
)

Regional Performance Comparison

Compare performance across different regions:

# Regional performance summary
result_regional <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = c("sales_amount", "quantity_sold", "profit_margin"),
  statistics = c("sum", "mean", "n"),
  addPercentage = TRUE
)

Time-Based Analysis and Date Handling

Analyze sales trends over time using date aggregation:

# Monthly sales aggregation
result_monthly <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "sale_date",
  sumVars = c("sales_amount", "quantity_sold"),
  statistics = c("sum", "mean", "n"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "month",
  sortBy = "groups"
)

Weekly Performance Patterns

Examine weekly patterns in sales data:

# Weekly sales patterns
result_weekly <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "sale_date",
  sumVars = "sales_amount",
  statistics = c("sum", "n"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "week"
)

Time Series with Grouping

Combine time aggregation with categorical grouping:

# Monthly sales by region
result_time_grouped <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = c("sale_date", "region"),
  sumVars = "sales_amount",
  statistics = c("sum", "mean"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "month"
)

Advanced Date and Time Handling

Financial Data with Timestamps

Work with precise timestamp data for financial analysis:

# Load financial data
data("groupsummary_financial_data")

# Examine timestamp structure
head(groupsummary_financial_data$transaction_time)
## [1] "2023-01-01 09:00:00 +03" "2023-01-01 10:00:00 +03"
## [3] "2023-01-01 11:00:00 +03" "2023-01-01 12:00:00 +03"
## [5] "2023-01-01 13:00:00 +03" "2023-01-01 14:00:00 +03"
# Hourly trading activity
result_hourly <- groupsummary(
  data = groupsummary_financial_data,
  groupVars = "transaction_time",
  sumVars = c("transaction_amount", "volume_traded"),
  statistics = c("sum", "mean", "n"),
  dateVar = "transaction_time",
  dateFormat = "ymd_hms",
  timeAggregation = "hour"
)
## Warning: 21 failed to parse.

Daily Market Analysis

Aggregate financial data by day and market sector:

# Daily market sector analysis
result_daily_market <- groupsummary(
  data = groupsummary_financial_data,
  groupVars = c("transaction_time", "market_sector"),
  sumVars = c("transaction_amount", "portfolio_value"),
  statistics = c("sum", "mean", "n"),
  dateVar = "transaction_time",
  dateFormat = "ymd_hms",
  timeAggregation = "day"
)
## Warning: 21 failed to parse.

Survey and Social Science Research

Demographic Analysis

Analyze survey responses by demographic groups:

# Load survey data
data("groupsummary_survey_data")

# Demographic summary
str(groupsummary_survey_data)
## 'data.frame':    250 obs. of  12 variables:
##  $ respondent_id       : chr  "R001" "R002" "R003" "R004" ...
##  $ survey_date         : Date, format: "2023-08-18" "2023-01-10" ...
##  $ age_group           : Ord.factor w/ 5 levels "18-25"<"26-35"<..: 4 1 4 5 5 2 4 3 5 2 ...
##  $ gender              : Factor w/ 3 levels "Female","Male",..: 2 2 1 1 1 1 1 1 1 1 ...
##  $ education           : Ord.factor w/ 3 levels "High School"<..: 2 2 3 3 1 1 1 1 1 3 ...
##  $ income_bracket      : Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 2 1 1 2 2 1 2 2 1 ...
##  $ location            : Factor w/ 3 levels "Rural","Suburban",..: 3 1 1 2 2 1 1 3 3 3 ...
##  $ satisfaction_score  : int  2 1 3 8 6 5 7 4 6 6 ...
##  $ trust_score         : int  6 2 3 2 6 5 1 1 4 1 ...
##  $ likelihood_recommend: int  1 3 6 1 0 1 8 0 9 0 ...
##  $ time_spent_minutes  : num  12 40 9 6 38 4 3 20 33 23 ...
##  $ number_of_issues    : int  0 0 5 2 2 5 3 2 0 4 ...
# Satisfaction by age group and gender
result_demographics <- groupsummary(
  data = groupsummary_survey_data,
  groupVars = c("age_group", "gender"),
  sumVars = c("satisfaction_score", "trust_score", "likelihood_recommend"),
  statistics = c("mean", "median", "n"),
  sortBy = "first_desc"
)

Education Level Impact

Examine the relationship between education and survey responses:

# Education level analysis
result_education <- groupsummary(
  data = groupsummary_survey_data,
  groupVars = "education",
  sumVars = c("satisfaction_score", "trust_score", "time_spent_minutes"),
  statistics = c("mean", "median", "n"),
  sortBy = "first_desc"
)

Income Bracket Analysis

Analyze responses by income levels:

# Income bracket analysis
result_income <- groupsummary(
  data = groupsummary_survey_data,
  groupVars = c("income_bracket", "location"),
  sumVars = c("satisfaction_score", "likelihood_recommend"),
  statistics = c("mean", "n"),
  sortBy = "first_desc"
)

Manufacturing and Quality Control

Production Line Analysis

Analyze manufacturing data by production parameters:

# Load manufacturing data
data("groupsummary_manufacturing_data")

# Production line performance
result_production <- groupsummary(
  data = groupsummary_manufacturing_data,
  groupVars = c("production_line", "shift"),
  sumVars = c("units_produced", "defect_count", "efficiency_percent"),
  statistics = c("sum", "mean", "n"),
  sortBy = "first_desc"
)

Quality Analysis by Grade

Examine quality metrics across different grades:

# Quality grade analysis
result_quality <- groupsummary(
  data = groupsummary_manufacturing_data,
  groupVars = c("quality_grade", "operator_experience"),
  sumVars = c("units_produced", "defect_count", "cycle_time_minutes"),
  statistics = c("mean", "n"),
  sortBy = "first_desc"
)

Track production trends over time:

# Monthly production trends
result_production_trends <- groupsummary(
  data = groupsummary_manufacturing_data,
  groupVars = "production_date",
  sumVars = c("units_produced", "defect_count", "efficiency_percent"),
  statistics = c("sum", "mean"),
  dateVar = "production_date",
  dateFormat = "ymd",
  timeAggregation = "month"
)

Web Analytics and Digital Marketing

Traffic Source Analysis

Analyze website traffic by source and device:

# Load web analytics data
data("groupsummary_web_analytics")

# Traffic source performance
result_traffic <- groupsummary(
  data = groupsummary_web_analytics,
  groupVars = c("traffic_source", "device_type"),
  sumVars = c("page_views", "session_duration_minutes", "conversion_value"),
  statistics = c("sum", "mean", "n"),
  sortBy = "first_desc"
)

User Behavior Analysis

Examine user behavior patterns:

# User type and country analysis
result_users <- groupsummary(
  data = groupsummary_web_analytics,
  groupVars = c("user_type", "country"),
  sumVars = c("page_views", "session_duration_minutes", "bounce_rate_percent"),
  statistics = c("mean", "n"),
  sortBy = "first_desc"
)

Daily Traffic Patterns

Analyze daily traffic patterns:

# Daily traffic analysis
result_daily_traffic <- groupsummary(
  data = groupsummary_web_analytics,
  groupVars = "session_time",
  sumVars = c("page_views", "conversion_value"),
  statistics = c("sum", "mean", "n"),
  dateVar = "session_time",
  dateFormat = "ymd_hms",
  timeAggregation = "day"
)
## Warning: 25 failed to parse.

Clinical Research Applications

Multi-Center Study Analysis

Analyze clinical trial data across multiple centers:

# Examine medical research data
str(medical_research_data)
## 'data.frame':    890 obs. of  32 variables:
##  $ PatientID          : chr  "PT001" "PT001" "PT001" "PT002" ...
##  $ VisitNumber        : int  1 2 3 1 2 3 1 2 3 4 ...
##  $ VisitDate          : Date, format: "2022-01-01" "2022-02-09" ...
##  $ StudyCenter        : Factor w/ 5 levels "Boston Medical Center",..: 1 1 1 3 3 3 1 1 1 1 ...
##  $ AgeGroup           : Ord.factor w/ 5 levels "18-30"<"31-45"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Gender             : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 1 1 1 1 ...
##  $ TreatmentGroup     : Factor w/ 4 levels "Control","Treatment A",..: 2 2 2 3 3 3 2 2 2 2 ...
##  $ DiagnosisPrimary   : Factor w/ 5 levels "Chronic Kidney Disease",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ DiseaseStage       : Ord.factor w/ 4 levels "Stage I"<"Stage II"<..: 2 2 2 2 2 2 1 1 1 1 ...
##  $ ComorbidityCount   : int  3 3 3 3 3 3 2 2 2 2 ...
##  $ BMICategory        : Ord.factor w/ 4 levels "Underweight"<..: 2 2 2 3 3 3 2 2 2 2 ...
##  $ SystolicBP         : int  142 141 143 120 118 125 139 145 124 133 ...
##  $ DiastolicBP        : int  111 103 111 96 94 94 101 109 89 106 ...
##  $ HeartRate          : int  84 66 55 76 88 72 83 101 66 71 ...
##  $ Temperature        : num  36.6 37.9 36.7 36.8 37.5 36.7 36 36.5 36.8 36.8 ...
##  $ RespiratoryRate    : int  16 11 16 15 18 27 12 20 14 17 ...
##  $ OxygenSaturation   : int  98 98 94 98 100 98 100 96 96 98 ...
##  $ HbA1c              : num  NA 9.4 9 6.7 9.1 10.4 10.3 5.9 8.8 NA ...
##  $ Creatinine         : num  0.74 0.6 0.46 0.73 1.43 1.08 1.08 0.63 0.97 NA ...
##  $ eGFR               : int  89 85 92 89 80 81 84 NA 94 87 ...
##  $ Hemoglobin         : num  14.1 11.5 NA 10.6 13.2 13.6 12.8 NA 14.2 13.8 ...
##  $ WBC                : num  NA 10 9.1 7.5 5.7 NA NA NA 3.9 4.2 ...
##  $ Platelets          : int  238 225 295 254 297 181 186 141 NA 249 ...
##  $ TotalCholesterol   : int  130 NA 165 236 216 198 146 216 154 192 ...
##  $ LDLCholesterol     : int  95 136 132 145 136 NA 113 148 146 84 ...
##  $ HDLCholesterol     : int  41 62 50 36 64 NA 50 41 48 58 ...
##  $ Triglycerides      : int  148 117 53 226 104 108 103 163 199 NA ...
##  $ PainScore          : int  NA NA 1 4 7 2 4 4 3 NA ...
##  $ QualityOfLife      : int  41 58 75 53 64 51 65 61 68 87 ...
##  $ FunctionalStatus   : int  91 97 69 63 74 76 45 100 49 87 ...
##  $ MedicationAdherence: int  88 100 97 100 87 88 73 100 69 60 ...
##  $ ClinicalScore      : int  59 55 70 50 46 67 47 40 67 46 ...
# Multi-center treatment analysis
result_clinical <- groupsummary(
  data = medical_research_data,
  groupVars = c("StudyCenter", "TreatmentGroup"),
  sumVars = c("ClinicalScore", "SystolicBP", "HbA1c"),
  statistics = c("mean", "n"),
  showMissing = FALSE,
  sortBy = "first_desc"
)

Disease Stage Analysis

Examine outcomes by disease stage:

# Disease stage analysis
result_disease <- groupsummary(
  data = medical_research_data,
  groupVars = c("DiseaseStage", "TreatmentGroup"),
  sumVars = c("ClinicalScore", "PainScore", "QualityOfLife"),
  statistics = c("mean", "median", "n"),
  showMissing = FALSE
)

Longitudinal Visit Patterns

Analyze visit patterns over time:

# Monthly visit patterns
result_visits <- groupsummary(
  data = medical_research_data,
  groupVars = "VisitDate",
  sumVars = "ClinicalScore",
  statistics = c("n"),
  dateVar = "VisitDate",
  dateFormat = "ymd",
  timeAggregation = "month"
)

Hospital Hourly Monitoring

Analyze hourly patient monitoring data:

# Load hospital hourly data
data("hospital_admission_hourly")

# Hourly vital signs by department
result_hourly_vitals <- groupsummary(
  data = hospital_admission_hourly,
  groupVars = c("AdmissionTime", "Department"),
  sumVars = c("HeartRate", "SystolicBP", "OxygenSaturation", "PainScore"),
  statistics = c("mean", "n"),
  dateVar = "AdmissionTime",
  dateFormat = "ymd_hms",
  timeAggregation = "hour"
)
## Warning: 50 failed to parse.

Advanced Features and Customization

Handling Missing Values

Control how missing values are handled in your analysis:

# Create data with missing values for demonstration
test_data <- medical_research_data
test_data$ClinicalScore[1:10] <- NA

# Analysis excluding missing values
result_exclude_missing <- groupsummary(
  data = test_data,
  groupVars = "TreatmentGroup",
  sumVars = "ClinicalScore",
  statistics = c("mean", "n"),
  showMissing = FALSE
)

# Analysis including missing values
result_include_missing <- groupsummary(
  data = test_data,
  groupVars = "TreatmentGroup",
  sumVars = "ClinicalScore",
  statistics = c("mean", "n"),
  showMissing = TRUE
)

Sorting and Display Options

Customize how results are sorted and displayed:

# Sort by group names (alphabetical)
result_sort_groups <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = "sales_amount",
  statistics = c("sum", "mean"),
  sortBy = "groups"
)

# Sort by first summary variable (descending)
result_sort_desc <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = "sales_amount",
  statistics = c("sum", "mean"),
  sortBy = "first_desc"
)

# Sort by first summary variable (ascending)
result_sort_asc <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = "sales_amount",
  statistics = c("sum", "mean"),
  sortBy = "first_asc"
)

Percentage Calculations

Add percentage calculations to your summaries:

# Sales by region with percentages
result_percentages <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = c("sales_amount", "quantity_sold"),
  statistics = c("sum", "mean", "n"),
  addPercentage = TRUE,
  sortBy = "first_desc"
)

Date Format and Time Aggregation Options

Multiple Date Formats

Handle different date formats in your data:

# Create test data with different date formats
date_test_data <- data.frame(
  date_ymd = as.Date(c("2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05")),
  value = c(100, 200, 150, 300),
  category = c("A", "B", "A", "B"),
  stringsAsFactors = FALSE
)

# YMD format analysis
result_ymd <- groupsummary(
  data = date_test_data,
  groupVars = "date_ymd",
  sumVars = "value",
  statistics = c("sum", "mean"),
  dateVar = "date_ymd",
  dateFormat = "ymd",
  timeAggregation = "month"
)

Time Aggregation Levels

Explore different time aggregation options:

# Daily aggregation
result_daily <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "sale_date",
  sumVars = "sales_amount",
  statistics = c("sum"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "day"
)

# Weekly aggregation
result_weekly_agg <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "sale_date",
  sumVars = "sales_amount",
  statistics = c("sum"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "week"
)

# Yearly aggregation
result_yearly <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "sale_date",
  sumVars = "sales_amount",
  statistics = c("sum"),
  dateVar = "sale_date",
  dateFormat = "ymd",
  timeAggregation = "year"
)

Statistical Considerations

Understanding Summary Statistics

Sum

  • Use Case: Total values, cumulative metrics
  • Example: Total sales revenue, total production units
  • Interpretation: Represents the aggregate amount across all observations in each group

Mean

  • Use Case: Average performance, typical values
  • Example: Average customer satisfaction, mean response time
  • Interpretation: Central tendency that can be influenced by outliers

Median

  • Use Case: Robust central tendency, skewed distributions
  • Example: Median income, median processing time
  • Interpretation: Middle value that is less affected by extreme values

Count (n)

  • Use Case: Sample sizes, frequency analysis
  • Example: Number of customers, number of transactions
  • Interpretation: Shows the number of observations in each group

Sample Size Considerations

Understanding the impact of group sizes on analysis:

# Analyze sample sizes across groups
sample_size_analysis <- groupsummary(
  data = medical_research_data,
  groupVars = c("StudyCenter", "TreatmentGroup"),
  sumVars = "ClinicalScore",
  statistics = c("n"),
  sortBy = "first_desc"
)

Dealing with Skewed Data

Strategies for handling skewed distributions:

# Compare mean vs median for potentially skewed data
skewness_check <- groupsummary(
  data = groupsummary_financial_data,
  groupVars = "market_sector",
  sumVars = c("transaction_amount", "portfolio_value"),
  statistics = c("mean", "median", "n"),
  sortBy = "first_desc"
)

Visualization and Interpretation

Understanding the Generated Plots

The Group and Summarize module automatically generates appropriate visualizations based on your data:

Bar Charts

  • When: Categorical grouping variables
  • Shows: Comparison of summary statistics across groups
  • Best For: Regional comparisons, category analysis

Time Series Plots

  • When: Date variables with time aggregation
  • Shows: Trends over time with connecting lines
  • Best For: Temporal patterns, seasonal analysis

Reading the Visualizations

Key elements to focus on:

  1. Axis Labels: Understand what’s being measured
  2. Value Labels: Exact numbers on bars or points
  3. Scale: Notice the range and magnitude of differences
  4. Trends: Look for patterns, increases, or decreases

Interpretation Guidelines

Comparing Groups

  • Large Differences: May indicate significant operational or strategic variations
  • Small Differences: Could suggest consistency or need for more targeted analysis
  • Outliers: Identify groups that perform unusually well or poorly

Time-Based Analysis

  • Trends: Increasing, decreasing, or stable patterns
  • Seasonality: Regular periodic patterns
  • Anomalies: Unusual spikes or drops that may need investigation

Best Practices and Recommendations

Data Preparation

Before Analysis

  1. Check Data Quality: Verify data types and handle missing values appropriately
  2. Validate Grouping Variables: Ensure categorical variables have meaningful levels
  3. Understand Your Data: Know the business context and expected patterns

Variable Selection

  1. Meaningful Groups: Choose grouping variables that align with business questions
  2. Relevant Metrics: Select summary variables that answer your research questions
  3. Appropriate Statistics: Choose statistics that make sense for your data type

Analysis Strategy

Start Simple

  1. Single Variable Grouping: Begin with one grouping variable
  2. Essential Statistics: Start with basic statistics (mean, count)
  3. Build Complexity: Gradually add more variables and statistics

Validate Results

  1. Sanity Checks: Verify results make business sense
  2. Cross-Validation: Check results against known benchmarks
  3. Sample Sizes: Ensure adequate sample sizes for reliable statistics

Reporting and Communication

Clear Presentation

  1. Descriptive Titles: Use clear, descriptive analysis titles
  2. Context: Provide background information for stakeholders
  3. Actionable Insights: Focus on findings that can drive decisions

Statistical Literacy

  1. Explain Methods: Help stakeholders understand the analysis approach
  2. Highlight Limitations: Discuss data limitations and caveats
  3. Provide Recommendations: Translate findings into actionable recommendations

Troubleshooting Common Issues

Data Type Issues

Factor vs Character Variables

# Ensure proper data types
groupsummary_sales_data$region <- as.factor(groupsummary_sales_data$region)
groupsummary_sales_data$product_category <- as.factor(groupsummary_sales_data$product_category)

Date Formatting Problems

# Check date formats before analysis
str(groupsummary_sales_data$sale_date)
##  Date[1:300], format: "2023-02-04" "2022-02-14" "2022-03-09" "2023-06-14" "2022-02-17" ...
# Convert if necessary
# groupsummary_sales_data$sale_date <- as.Date(groupsummary_sales_data$sale_date, format = "%Y-%m-%d")

Missing Data Strategies

Understanding Missing Patterns

# Check missing data patterns
missing_summary <- medical_research_data %>%
  summarise(across(everything(), ~ sum(is.na(.))))

print(missing_summary)
##   PatientID VisitNumber VisitDate StudyCenter AgeGroup Gender TreatmentGroup
## 1         0           0         0           0        0      0              0
##   DiagnosisPrimary DiseaseStage ComorbidityCount BMICategory SystolicBP
## 1                0            0                0           0          0
##   DiastolicBP HeartRate Temperature RespiratoryRate OxygenSaturation HbA1c
## 1           0         0           0               0                0   125
##   Creatinine eGFR Hemoglobin WBC Platelets TotalCholesterol LDLCholesterol
## 1        122  104        113 129       120              126            130
##   HDLCholesterol Triglycerides PainScore QualityOfLife FunctionalStatus
## 1            125           112        81            88               76
##   MedicationAdherence ClinicalScore
## 1                   0             0

Handling Missing Values

  1. Exclude Missing: Use showMissing = FALSE for complete case analysis
  2. Include Missing: Use showMissing = TRUE to see missing value patterns
  3. Imputation: Handle missing values before analysis if appropriate

Performance Considerations

Large Datasets

  • Consider filtering data to relevant time periods
  • Use appropriate date aggregation levels
  • Focus on key grouping variables

Memory Management

  • Process large datasets in chunks if necessary
  • Use efficient data types (factors for categories)
  • Remove unnecessary variables before analysis

Integration with Other Modules

Workflow Integration

The Group and Summarize module works well with other ClinicoPath modules:

Preprocessing

  1. Data Quality: Use data quality modules to clean data first
  2. Variable Creation: Create derived variables before grouping

Follow-up Analysis

  1. Statistical Testing: Use appropriate statistical tests for group comparisons
  2. Visualization: Create additional plots for detailed exploration
  3. Reporting: Export results for integration into reports

Export and Sharing

Table Export

  • Results tables can be copied and pasted into reports
  • Use professional formatting for presentations
  • Include metadata about analysis parameters

Visualization Export

  • Save plots for inclusion in presentations
  • Adjust plot dimensions for optimal display
  • Consider color-blind friendly palettes for wider accessibility

Advanced Use Cases

Quality Control Monitoring

Set up regular quality monitoring using groupsummary:

# Daily quality metrics
quality_monitoring <- groupsummary(
  data = groupsummary_manufacturing_data,
  groupVars = "production_date",
  sumVars = c("defect_count", "efficiency_percent"),
  statistics = c("mean", "sum"),
  dateVar = "production_date",
  dateFormat = "ymd",
  timeAggregation = "day"
)

Customer Segmentation

Create customer segments based on behavior:

# Customer value segmentation
customer_segments <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = c("customer_type", "sales_channel"),
  sumVars = c("sales_amount", "profit_margin"),
  statistics = c("mean", "sum", "n"),
  addPercentage = TRUE,
  sortBy = "first_desc"
)

Performance Benchmarking

Compare performance across different dimensions:

# Regional performance benchmarking
regional_benchmark <- groupsummary(
  data = groupsummary_sales_data,
  groupVars = "region",
  sumVars = c("sales_amount", "profit_margin", "customer_satisfaction"),
  statistics = c("mean", "median", "n"),
  sortBy = "first_desc"
)

Summary

The Group and Summarize module provides a powerful and flexible tool for data aggregation and analysis. Key benefits include:

Analytical Power

  • Flexible Grouping: Support for single and multiple grouping variables
  • Multiple Statistics: Comprehensive statistical summaries
  • Time Intelligence: Advanced date handling and aggregation

Practical Applications

  • Business Analytics: Sales, customer, and performance analysis
  • Clinical Research: Multi-center studies and patient outcomes
  • Quality Control: Manufacturing and process monitoring
  • Digital Analytics: Web traffic and user behavior analysis

User-Friendly Features

  • Automatic Visualization: Context-appropriate charts and plots
  • Professional Output: Publication-ready tables and graphics
  • Flexible Options: Customizable sorting, formatting, and display

This module serves as a cornerstone for exploratory data analysis and reporting, enabling users to quickly understand patterns, trends, and relationships in their data across various domains and applications.


References and Further Reading

Statistical Methods

  • Descriptive Statistics
  • Data Aggregation Techniques
  • Time Series Analysis
  • Missing Data Handling

Application Domains

  • Business Intelligence and Analytics
  • Clinical Research Methodology
  • Quality Control Systems
  • Digital Marketing Analytics

Software and Tools

  • dplyr for Data Manipulation
  • ggplot2 for Visualization
  • lubridate for Date Handling
  • jamovi Integration Framework