Patient Similarity Clustering — patientsimilarity • ClinicoPath

Visualizes patient similarity using dimensionality reduction techniques (PCA, t-SNE, UMAP, MDS). Projects high-dimensional patient data into 2D or 3D space to reveal natural patient groupings and subpopulations. Inspired by Orange Data Mining's interactive projection widgets, adapted for jamovi with comprehensive cluster analysis and statistical validation.

Usage

patientsimilarity(
  data,
  vars = NULL,
  method = "tsne",
  dimensions = "2",
  colorBy = NULL,
  perplexity = 30,
  iterations = 1000,
  umapNeighbors = 15,
  umapMinDist = 0.1,
  performClustering = FALSE,
  clusterMethod = "kmeans",
  nClusters = 3,
  showClusterStats = TRUE,
  survivalAnalysis = FALSE,
  survivalTime = NULL,
  survivalEvent = NULL,
  survivalEventLevel,
  scaleVars = TRUE,
  removeOutliers = FALSE,
  showLoadings = FALSE,
  show3DPlot = FALSE
)

Arguments

data: The dataset to be analyzed, provided as a data frame.
vars: Continuous variables to use for calculating patient similarity. These will be used to compute distances between patients. Categorical variables should be converted to numeric or one-hot encoded.
method: Method for dimensionality reduction: - PCA: Linear method, preserves global structure - t-SNE: Non-linear, excellent for visualization, preserves local structure - UMAP: Non-linear, preserves both local and global structure, faster than t-SNE - MDS: Classical method, preserves pairwise distances
dimensions: Number of dimensions for projection. 2D is easier to interpret, 3D can reveal additional structure.
colorBy: Variable to use for coloring points. Typically an outcome variable (e.g., disease status, survival, response) to see if it corresponds to natural patient groupings.
perplexity: Perplexity parameter for t-SNE. Roughly corresponds to the number of nearest neighbors considered. Typical values: 5-50. Higher values preserve more global structure.
iterations: Number of iterations for t-SNE optimization. More iterations improve convergence but take longer.
umapNeighbors: Number of nearest neighbors for UMAP. Controls local vs global structure. Smaller values preserve local structure, larger values preserve global.
umapMinDist: Minimum distance between points in UMAP. Controls how tightly points are packed. Smaller values create tighter clusters.
performClustering: Automatically identify patient clusters using k-means or hierarchical clustering on the reduced-dimension space.
clusterMethod: Method for clustering patients in the reduced space.
nClusters: Number of clusters for k-means or hierarchical clustering. For DBSCAN, this is ignored.
showClusterStats: Display summary statistics for each cluster including size, characteristics, and outcome distribution.
survivalAnalysis: If survival data is available, compare survival across discovered clusters. Useful for identifying prognostic patient subtypes.
survivalTime: Time to event or censoring for survival analysis.
survivalEvent: Event indicator (1=event, 0=censored).
survivalEventLevel: Level indicating the event occurred.
scaleVars: Standardize variables to mean=0, sd=1 before analysis. Recommended when variables have different scales.
removeOutliers: Remove outliers before analysis using IQR method. May improve visualization quality.
showLoadings: Show how original variables contribute to each dimension. Only available for PCA and MDS.
show3DPlot: Generate interactive 3D plot using plotly (if dimensions=3).

Value

A results object containing:

`results$instructions`					a html
`results$summaryText`					a preformatted
`results$projectionPlot`					an image
`results$projection3D`					an image
`results$varianceTable`					a table
`results$loadingsTable`					a table
`results$clusterHeading`					a html
`results$clusterSummary`					a table
`results$clusterCharacteristics`					a table
`results$clusterOutcomes`					a table
`results$clusterQuality`					a table
`results$survivalHeading`					a html
`results$survivalTable`					a table
`results$survivalPlot`					an image
`results$survivalComparison`					a table
`results$exportClusters`					an output
`results$exportCoordinates`					an output
`results$interpretation`					a html

Tables can be converted to data frames with asDF or as.data.frame. For example:

results$varianceTable$asDF

as.data.frame(results$varianceTable)

Examples

# Example 1: Basic t-SNE visualization
library(Rtsne)
data(iris)

patientsimilarity(
    data = iris,
    vars = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
    method = "tsne",
    colorBy = "Species"
)

# Example 2: UMAP with cluster analysis
patientsimilarity(
    data = clinical_data,
    vars = c("age", "tumor_size", "grade", "ki67"),
    method = "umap",
    colorBy = "survival_status",
    performClustering = TRUE,
    nClusters = 3,
    showClusterStats = TRUE
)

# Example 3: PCA with survival comparison
patientsimilarity(
    data = pathology_data,
    vars = c("age", "stage", "nodes", "size"),
    method = "pca",
    colorBy = "death",
    dimensions = 3,
    survivalTime = "months",
    survivalEvent = "death"
)