High-Performance Scatter Plots with jscattermore
Comprehensive Guide to Fast Visualization of Large Datasets
ClinicoPath
last-modified
Source:vignettes/jjstatsplot-31-jscattermore-comprehensive.Rmd
jjstatsplot-31-jscattermore-comprehensive.Rmd
Introduction
The jscattermore
function provides high-performance
scatter plot visualization using the scattermore
package,
designed specifically for plotting millions of data points efficiently.
This function is ideal for large datasets where traditional plotting
methods become slow or memory-intensive.
Key Features
- High Performance: Efficiently plot millions of points using raster graphics
- Multiple Backends: Support for base R, ggplot2, and optimized ggplot2 rendering
- Advanced Aesthetics: Color and size mapping with multiple palette options
- Statistical Overlays: Correlation coefficients, smooth lines, and density contours
- Customization: Extensive theming and styling options
- Performance Monitoring: Built-in performance measurement and comparison
- Memory Efficient: Optimized memory usage through intelligent caching
Installation and Setup
# Load required libraries
library(ClinicoPath)
## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::count_automorphisms' by
## 'igraph::count_automorphisms' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'DiagrammeR::get_edge_ids' by
## 'igraph::get_edge_ids' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::union' by 'lubridate::union' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::%--%' by 'lubridate::%--%' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tnr' by 'mlr3measures::tnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::precision' by
## 'mlr3measures::precision' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tn' by 'mlr3measures::tn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fnr' by 'mlr3measures::fnr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tp' by 'mlr3measures::tp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::npv' by 'mlr3measures::npv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::ppv' by 'mlr3measures::ppv' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::auc' by 'mlr3measures::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::tpr' by 'mlr3measures::tpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fn' by 'mlr3measures::fn' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fp' by 'mlr3measures::fp' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::fpr' by 'mlr3measures::fpr' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::recall' by
## 'mlr3measures::recall' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::specificity' by
## 'mlr3measures::specificity' when loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::sensitivity' by
## 'mlr3measures::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::as_data_frame' by
## 'tibble::as_data_frame' when loading 'ClinicoPath'
## Warning: replacing previous import 'igraph::crossing' by 'tidyr::crossing' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::sensitivity' by
## 'caret::sensitivity' when loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::specificity' by
## 'caret::specificity' when loading 'ClinicoPath'
## Registered S3 methods overwritten by 'useful':
## method from
## autoplot.acf ggfortify
## fortify.acf ggfortify
## fortify.kmeans ggfortify
## fortify.ts ggfortify
## Warning: replacing previous import 'jmvcore::select' by 'dplyr::select' when
## loading 'ClinicoPath'
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
## Warning: replacing previous import 'DataExplorer::plot_histogram' by
## 'grafify::plot_histogram' when loading 'ClinicoPath'
## Warning: replacing previous import 'dplyr::select' by 'jmvcore::select' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'mlr3measures::auc' by 'pROC::auc' when
## loading 'ClinicoPath'
## Warning: replacing previous import 'cutpointr::roc' by 'pROC::roc' when loading
## 'ClinicoPath'
## Warning: replacing previous import 'tibble::view' by 'summarytools::view' when
## loading 'ClinicoPath'
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Set options for better output
options(digits = 3)
knitr::opts_chunk$set(
fig.width = 10,
fig.height = 6,
dpi = 300,
out.width = "100%",
echo = TRUE,
eval = FALSE
)
# Check if scattermore is available
if (!requireNamespace("scattermore", quietly = TRUE)) {
message("Note: scattermore package not available. Install with: install.packages('scattermore')")
}
Basic Usage
Simple Scatter Plot
The most basic usage requires only x and y variables:
# Create sample data
set.seed(123)
basic_data <- data.frame(
x = rnorm(5000),
y = rnorm(5000) + 0.5 * rnorm(5000),
group = factor(sample(c("A", "B", "C"), 5000, replace = TRUE))
)
# Basic scatter plot
result_basic <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
facet_var = NULL,
color_var = NULL,
size_var = NULL
)
# The result contains the plot and summary information
print(names(result_basic))
Plot Types
Base R Implementation
result_base <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
plot_type = "base_r",
show_correlation = TRUE,
plot_title = "Base R Scatter Plot"
)
ggplot2 Implementation
result_ggplot <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
plot_type = "ggplot2",
show_correlation = TRUE,
plot_title = "ggplot2 Scatter Plot"
)
Optimized ggplot2 Implementation
result_opt <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
plot_type = "ggplot2_opt",
show_correlation = TRUE,
plot_title = "Optimized ggplot2 Scatter Plot"
)
Advanced Aesthetics
Color Mapping
Continuous Color Variables
# Add continuous color variable
continuous_data <- basic_data %>%
mutate(intensity = x^2 + y^2 + rnorm(nrow(basic_data), sd = 0.1))
result_cont_color <- jscattermore(
data = continuous_data,
x_var = "x",
y_var = "y",
color_var = "intensity",
color_palette = "viridis",
plot_title = "Continuous Color Mapping"
)
Categorical Color Variables
result_cat_color <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
color_var = "group",
color_palette = "plasma",
plot_title = "Categorical Color Mapping"
)
Color Palettes
Comparing Different Palettes
palettes <- c("viridis", "plasma", "inferno", "magma", "cividis")
# Create a comparison (showing code for one palette)
result_palette <- jscattermore(
data = continuous_data,
x_var = "x",
y_var = "y",
color_var = "intensity",
color_palette = "inferno",
plot_title = "Inferno Color Palette"
)
Statistical Overlays
Correlation Display
# Data with strong correlation
corr_data <- data.frame(
x = rnorm(2000),
y = NA
)
corr_data$y <- 0.8 * corr_data$x + rnorm(2000, sd = 0.5)
result_corr <- jscattermore(
data = corr_data,
x_var = "x",
y_var = "y",
show_correlation = TRUE,
plot_title = "Scatter Plot with Correlation"
)
Smooth Lines
# Linear smooth
result_smooth_lm <- jscattermore(
data = corr_data,
x_var = "x",
y_var = "y",
show_smooth = TRUE,
smooth_method = "lm",
show_correlation = TRUE,
plot_title = "Linear Smooth Line"
)
# LOESS smooth
result_smooth_loess <- jscattermore(
data = corr_data,
x_var = "x",
y_var = "y",
show_smooth = TRUE,
smooth_method = "loess",
show_correlation = TRUE,
plot_title = "LOESS Smooth Line"
)
Density Contours
result_density <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
show_density = TRUE,
alpha = 0.6,
plot_title = "Scatter Plot with Density Contours"
)
Data Transformations
Log Transformations
# Create positive data for log transformation
log_data <- data.frame(
x = exp(rnorm(3000, mean = 2, sd = 1)),
y = exp(rnorm(3000, mean = 1.5, sd = 0.8))
)
# Log transform X-axis
result_log_x <- jscattermore(
data = log_data,
x_var = "x",
y_var = "y",
log_transform_x = TRUE,
plot_title = "Log-transformed X-axis"
)
# Log transform both axes
result_log_both <- jscattermore(
data = log_data,
x_var = "x",
y_var = "y",
log_transform_x = TRUE,
log_transform_y = TRUE,
plot_title = "Log-log Transformation"
)
Faceting and Grouping
Faceted Plots
# Multi-group data for faceting
facet_data <- data.frame(
x = rnorm(1500),
y = rnorm(1500),
condition = factor(rep(c("Control", "Treatment_A", "Treatment_B"), each = 500)),
time_point = factor(rep(c("Baseline", "Week_4", "Week_8"), times = 500))
)
result_facet <- jscattermore(
data = facet_data,
x_var = "x",
y_var = "y",
facet_var = "condition",
color_var = "time_point",
plot_title = "Faceted Scatter Plot by Condition"
)
Performance Optimization
Performance Monitoring
# Large dataset for performance testing
large_data <- data.frame(
x = rnorm(20000),
y = rnorm(20000) + 0.3 * rnorm(20000),
group = factor(sample(c("Group1", "Group2", "Group3"), 20000, replace = TRUE))
)
result_performance <- jscattermore(
data = large_data,
x_var = "x",
y_var = "y",
color_var = "group",
show_performance = TRUE,
plot_title = "Performance Monitoring Example"
)
Raster Settings
Resolution Control
# High resolution for publication quality
result_high_res <- jscattermore(
data = large_data,
x_var = "x",
y_var = "y",
pixels = 1024, # High resolution
interpolate = TRUE,
plot_title = "High Resolution (1024x1024)"
)
# Lower resolution for faster rendering
result_low_res <- jscattermore(
data = large_data,
x_var = "x",
y_var = "y",
pixels = 256, # Lower resolution
interpolate = FALSE,
plot_title = "Lower Resolution (256x256)"
)
Point Size Optimization
# Automatic point sizing
result_auto_size <- jscattermore(
data = large_data,
x_var = "x",
y_var = "y",
pointsize = 0, # Automatic
plot_title = "Automatic Point Sizing"
)
# Manual point sizing
result_manual_size <- jscattermore(
data = large_data,
x_var = "x",
y_var = "y",
pointsize = 2, # Manual
plot_title = "Manual Point Sizing"
)
Customization and Theming
Visual Themes
# Different themes
theme_data <- basic_data[1:1000, ]
# Minimal theme
result_minimal <- jscattermore(
data = theme_data,
x_var = "x",
y_var = "y",
color_var = "group",
theme_style = "minimal",
plot_title = "Minimal Theme"
)
# Dark theme
result_dark <- jscattermore(
data = theme_data,
x_var = "x",
y_var = "y",
color_var = "group",
theme_style = "dark",
plot_title = "Dark Theme"
)
Custom Labels and Titles
result_labels <- jscattermore(
data = basic_data,
x_var = "x",
y_var = "y",
color_var = "group",
x_label = "Predictor Variable (X)",
y_label = "Response Variable (Y)",
plot_title = "Custom Scatter Plot with Detailed Labels"
)
Alpha and Transparency
# Different alpha levels for overplotting
alpha_data <- data.frame(
x = rnorm(8000),
y = rnorm(8000)
)
# High alpha (less transparent)
result_alpha_high <- jscattermore(
data = alpha_data,
x_var = "x",
y_var = "y",
alpha = 0.9,
plot_title = "High Alpha (α = 0.9)"
)
# Low alpha (more transparent)
result_alpha_low <- jscattermore(
data = alpha_data,
x_var = "x",
y_var = "y",
alpha = 0.3,
plot_title = "Low Alpha (α = 0.3)"
)
Real-World Applications
Medical Research Example
# Simulate medical research data
set.seed(456)
medical_data <- data.frame(
patient_id = 1:2000,
age = round(rnorm(2000, mean = 55, sd = 15)),
bmi = rnorm(2000, mean = 26, sd = 4),
systolic_bp = round(rnorm(2000, mean = 130, sd = 20)),
cholesterol = rnorm(2000, mean = 200, sd = 30),
treatment = factor(sample(c("Control", "Drug_A", "Drug_B"), 2000, replace = TRUE)),
outcome_score = rnorm(2000, mean = 50, sd = 15)
) %>%
# Create realistic relationships
mutate(
systolic_bp = systolic_bp + 0.5 * age + 1.5 * bmi,
cholesterol = cholesterol + 0.3 * age + 2 * bmi,
outcome_score = outcome_score - 0.2 * age + 0.1 * cholesterol
)
# BMI vs Blood Pressure colored by treatment
result_medical <- jscattermore(
data = medical_data,
x_var = "bmi",
y_var = "systolic_bp",
color_var = "treatment",
size_var = "age",
show_smooth = TRUE,
smooth_method = "lm",
show_correlation = TRUE,
x_label = "Body Mass Index (BMI)",
y_label = "Systolic Blood Pressure (mmHg)",
plot_title = "BMI vs Blood Pressure by Treatment Group"
)
Time Series Analysis
# Time series scatter plot
time_data <- data.frame(
time = 1:1000,
signal = sin(seq(0, 4*pi, length.out = 1000)) + rnorm(1000, sd = 0.2),
noise = rnorm(1000),
phase = factor(rep(c("Phase1", "Phase2", "Phase3", "Phase4"), each = 250))
)
result_timeseries <- jscattermore(
data = time_data,
x_var = "time",
y_var = "signal",
color_var = "phase",
show_smooth = TRUE,
smooth_method = "loess",
x_label = "Time Points",
y_label = "Signal Amplitude",
plot_title = "Time Series Signal Analysis"
)
High-Dimensional Data Visualization
# Simulate high-dimensional data with clustering
set.seed(789)
n_points <- 3000
cluster_centers <- rbind(
c(0, 0), c(3, 3), c(-2, 4), c(4, -1)
)
high_dim_data <- do.call(rbind, lapply(1:4, function(i) {
n_cluster <- n_points / 4
data.frame(
x = rnorm(n_cluster, cluster_centers[i, 1], 1),
y = rnorm(n_cluster, cluster_centers[i, 2], 1),
cluster = factor(paste("Cluster", i)),
density = runif(n_cluster, 0.1, 1.0),
dimension_3 = rnorm(n_cluster, cluster_centers[i, 1] + cluster_centers[i, 2], 0.5)
)
}))
result_clusters <- jscattermore(
data = high_dim_data,
x_var = "x",
y_var = "y",
color_var = "dimension_3",
size_var = "density",
facet_var = "cluster",
color_palette = "viridis",
plot_title = "High-Dimensional Clustering Visualization"
)
Performance Comparison
Benchmarking Different Methods
# Large dataset for benchmarking
benchmark_data <- data.frame(
x = rnorm(50000),
y = rnorm(50000),
group = factor(sample(c("A", "B", "C", "D"), 50000, replace = TRUE))
)
# Benchmark different plot types with performance monitoring
benchmark_results <- list()
# Base R method
benchmark_results$base_r <- jscattermore(
data = benchmark_data,
x_var = "x",
y_var = "y",
plot_type = "base_r",
show_performance = TRUE,
plot_title = "Base R Performance Test"
)
# ggplot2 method
benchmark_results$ggplot2 <- jscattermore(
data = benchmark_data,
x_var = "x",
y_var = "y",
plot_type = "ggplot2",
show_performance = TRUE,
plot_title = "ggplot2 Performance Test"
)
# Optimized ggplot2 method
benchmark_results$ggplot2_opt <- jscattermore(
data = benchmark_data,
x_var = "x",
y_var = "y",
plot_type = "ggplot2_opt",
show_performance = TRUE,
plot_title = "Optimized ggplot2 Performance Test"
)
Error Handling and Edge Cases
Missing Data
# Data with missing values
missing_data <- basic_data
missing_data$x[sample(1:nrow(missing_data), 200)] <- NA
missing_data$y[sample(1:nrow(missing_data), 150)] <- NA
result_missing <- jscattermore(
data = missing_data,
x_var = "x",
y_var = "y",
color_var = "group",
plot_title = "Handling Missing Data"
)
Single Data Points
# Edge case: single data point
single_data <- data.frame(x = 1, y = 1)
result_single <- jscattermore(
data = single_data,
x_var = "x",
y_var = "y",
plot_title = "Single Data Point"
)
Identical Values
# Edge case: all identical values
identical_data <- data.frame(
x = rep(5, 100),
y = rep(5, 100)
)
result_identical <- jscattermore(
data = identical_data,
x_var = "x",
y_var = "y",
plot_title = "Identical Values"
)
Best Practices
1. Choosing Plot Types
- base_r: Best for simple, fast plots without ggplot2 dependencies
- ggplot2: Standard ggplot2 integration with full theming support
- ggplot2_opt: Most optimized for very large datasets
2. Performance Optimization
# For very large datasets
performance_tips <- data.frame(
tip = c(
"Use lower pixel resolution for interactive exploration",
"Use higher resolution for final publication plots",
"Enable interpolation for smoother appearance",
"Disable correlation calculation for huge datasets",
"Use appropriate alpha values to handle overplotting",
"Consider sampling large datasets for initial exploration"
),
code_example = c(
"pixels = 256",
"pixels = 1024",
"interpolate = TRUE",
"show_correlation = FALSE",
"alpha = 0.3",
"sample_n(data, 10000)"
)
)
print(performance_tips)
3. Color and Aesthetics
# Best practices for color mapping
aesthetic_guidelines <- data.frame(
aspect = c(
"Continuous variables",
"Categorical variables",
"Large datasets",
"Publication plots",
"Accessibility"
),
recommendation = c(
"Use perceptually uniform palettes (viridis, plasma)",
"Use qualitatively different colors",
"Use transparency (alpha < 1) to show density",
"Use high resolution and interpolation",
"Avoid red-green combinations, use colorblind-friendly palettes"
),
palette_suggestion = c(
"viridis, plasma, inferno",
"Set1, Dark2, Accent",
"Any with alpha = 0.3-0.7",
"viridis with interpolate = TRUE",
"viridis, cividis"
)
)
print(aesthetic_guidelines)
Summary
The jscattermore
function provides a powerful and
efficient solution for visualizing large datasets through
high-performance scatter plots. Key advantages include:
Performance Benefits
- Speed: 10-100x faster than traditional plotting for large datasets
- Memory Efficiency: Raster-based rendering reduces memory usage
- Scalability: Can handle millions of data points
- Caching: Intelligent caching system reduces redundant computations
Visualization Features
- Multiple Backends: Choose optimal rendering method for your needs
- Rich Aesthetics: Color and size mapping with multiple palette options
- Statistical Overlays: Correlation, smooth lines, density contours
- Customization: Extensive theming and styling options
Use Cases
- Exploratory Data Analysis: Quick visualization of large datasets
- Time Series Analysis: Efficient plotting of temporal data
- Clustering Visualization: Color-coded cluster analysis
- Medical Research: Patient data visualization with multiple variables
- Performance Monitoring: Built-in benchmarking capabilities
The function is designed to handle real-world scenarios including missing data, edge cases, and performance requirements while maintaining high visual quality and statistical accuracy.
Next Steps
- Explore the ClinicoPath package documentation for more statistical analysis functions
- Try the function with your own large datasets
- Experiment with different plot types and aesthetic options
- Use performance monitoring to optimize your workflows