This project was developed with the assistance of Claude, an AI assistant by Anthropic.
A comprehensive metadata extraction toolkit for digital pathology research, designed to standardize cohort definition and image analysis workflows in QuPath.
The QuPath Cohort Metadata Extractor is a robust workflow designed for pathologists and researchers who need to systematically analyze large collections of whole slide images (WSI). It automatically extracts comprehensive metadata from images in QuPath projects, enabling efficient cohort definition, quality control, and standardized analysis workflows.
git clone https://github.com/sbalci/metadata-qupath.git
cd qupath-cohort-extractor
%USERPROFILE%\.qupath\scripts\
~/.qupath/scripts/
~/.qupath/scripts/
QuPathCohortExtractor.groovy
- Full-featured versionSimpleMetadataExtractor.groovy
- Lightweight version for testingQuPath_v06_Compatible.groovy
- Optimized for QuPath 0.6+MenuSetup.groovy
to your QuPath scripts directoryrunScript(new File(QPEx.getQuPathUserDirectory(), "scripts/MenuSetup.groovy"))
// For menu-integrated version
// Navigate to: Analyze > Cohort Analysis > Extract Cohort Metadata
// For direct script execution
// Run the QuPath_v06_Compatible.groovy script
cohort_metadata/
directory within your project foldercohort_metadata_v06.csv
in Excel or your preferred analysis tool// Analyze currently open image
def projectEntry = QPEx.getProjectEntry()
def extractor = new CohortMetadataExtractor(projectEntry)
def metadata = extractor.extractMetadata()
println("Metadata extracted: ${metadata.size()} fields")
// Load exported metadata
def cohortData = CohortUtils.loadCohortMetadata("cohort_metadata_v06.csv")
// Filter high-quality images
def highQualityImages = CohortUtils.filterImages(cohortData, [
has_pyramid: true,
scan_warning: "NONE",
estimated_magnification: 40
])
println("Found ${highQualityImages.size()} high-quality 40x images")
cohort_metadata_v06.csv
Contains 50+ columns of metadata including:
| Field | Description | Example |
|ββ-|ββββ-|βββ|
| image_name
| Filename of the image | kontrol15.01.25_14_6_134952.svs
|
| width_pixels
| Image width in pixels | 47622
|
| height_pixels
| Image height in pixels | 63413
|
| pixel_width_um
| Pixel size in micrometers | 0.263312
|
| estimated_magnification
| Calculated magnification | 40
|
| Field | Description | Example |
|ββ-|ββββ-|βββ|
| scanner_type
| Scanner model | GT450
|
| scanscope_id
| Scanner identifier | 1111111
|
| scan_date
| Date of image acquisition | 01/07/2025
|
| scan_time
| Time of image acquisition | 08:29:16
|
| apparent_magnification
| Scanner-reported magnification | 40X
|
| Field | Description | Example |
|ββ-|ββββ-|βββ|
| has_pyramid
| Whether image has pyramid structure | true
|
| scan_warning
| Any scanner warnings | NONE
|
| compression_quality
| JPEG compression quality | 91
|
| file_size_mb
| File size in megabytes | 563.87
|
| Field | Description | Example |
|ββ-|ββββ-|βββ|
| suggested_analysis_level
| Optimal pyramid level for analysis | 1
|
| needs_pyramid
| Whether image needs pyramid for performance | false
|
detailed_summary_v06.txt
: Human-readable summary with statisticsprocessing_log.txt
: Detailed processing log with any errorsimport pandas as pd
import matplotlib.pyplot as plt
# Load cohort data
df = pd.read_csv('cohort_metadata_v06.csv')
# Basic statistics
print(f"Total images: {len(df)}")
print(f"Scanners: {df['scanner_type'].unique()}")
print(f"Date range: {df['scan_date'].min()} to {df['scan_date'].max()}")
# Quality assessment
quality_issues = df[
(df['scan_warning'] != 'NONE') |
(df['compression_quality'] < 85) |
(~df['has_pyramid'])
]
print(f"Images with quality concerns: {len(quality_issues)}")
# Magnification distribution
df['estimated_magnification'].hist(bins=20)
plt.title('Magnification Distribution')
plt.xlabel('Magnification')
plt.ylabel('Number of Images')
plt.show()
library(dplyr)
library(ggplot2)
# Load data
cohort_data <- read.csv("cohort_metadata_v06.csv")
# Scanner analysis
scanner_summary <- cohort_data %>%
group_by(scanner_type, scan_date) %>%
summarise(
image_count = n(),
avg_file_size = mean(file_size_mb, na.rm = TRUE),
.groups = 'drop'
)
# Visualization
ggplot(cohort_data, aes(x = pixel_width_um, y = estimated_magnification)) +
geom_point(aes(color = scanner_type)) +
labs(title = "Pixel Size vs Magnification by Scanner",
x = "Pixel Width (ΞΌm)", y = "Estimated Magnification")
Add custom extraction logic by extending the CohortMetadataExtractor
class:
class CustomExtractor extends CohortMetadataExtractor {
def extractStainInfo() {
// Custom stain detection logic
if (metadata.description?.toLowerCase()?.contains('he')) {
metadata.stain_type = 'H&E'
}
}
}
// Use metadata to set analysis parameters
def cohortData = CohortUtils.loadCohortMetadata("cohort_metadata_v06.csv")
def currentImage = cohortData.find { it.image_name == getCurrentImageData().getServer().getMetadata().get('Name') }
if (currentImage) {
def analysisLevel = currentImage.suggested_analysis_level
def pixelSize = currentImage.pixel_width_um
// Configure your analysis based on metadata
println("Using analysis level: ${analysisLevel}")
println("Target pixel size: ${pixelSize * Math.pow(2, analysisLevel)} ΞΌm")
}
Issue: βNo signature of method getImageType()β
QuPath_v06_Compatible.groovy
for QuPath 0.6+Issue: CSV file has only 4 columns
Issue: βCould not load serverβ errors
Issue: Missing scanner metadata
We welcome contributions from the digital pathology community!
git checkout -b feature/amazing-feature
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature
If you use this workflow in your research, please cite:
@software{qupath_cohort_extractor,
title={QuPath Cohort Metadata Extractor},
author={[Your Name/Institution]},
year={2025},
url={https://github.com/sbalci/qupath-cohort-extractor}
}
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the digital pathology community
Star β this repository if you find it useful!