Fréchet Radiomic Distance

A Versatile Metric for Comparing Medical Imaging Datasets

†Nicholas Konz¹, †Richard Osuala²˒³˒⁴, Preeti Verma², Yuwen Chen¹, Hanxue Gu¹, Haoyu Dong¹, Yaqian Chen¹,
Andrew Marshall⁵, Lidia Garrucho², Kaisar Kushibar², Daniel M. Lang³˒⁴, Gene S. Kim¹, Lars J. Grimm¹,
John M. Lewin⁵, James S. Duncan⁵, Julia A. Schnabel³˒⁴˒⁶, Oliver Diaz²˒⁷, Karim Lekadir²˒⁸, Maciej A. Mazurowski¹

† Shared first authors

¹ Duke University² Universitat de Barcelona³ Helmholtz Munich⁴ Technical University Munich ⁵ Yale University
⁶ King's College London⁷ Computer Vision Center (UAB) ⁸ Institució Catalana de Recerca i Estudis Avançats (ICREA)

📄 Medical Image Analysis 2026 🏆 Volume 110, Article 103943

About Fréchet Radiomic Distance (FRD)

The Problem

Determining whether two sets of medical images (f.e. synthetic and real) belong to the same or different distributions is crucial for evaluating generative models and detecting out-of-domain samples. Current metrics either rely on specific narrow downstream tasks (f.e. classification) or adopt task-independent perceptual metrics from natural imaging.

Our Solution

FRD is a perceptual metric tailored for medical images that utilizes standardized, clinically meaningful, and interpretable radiomics features to compare image sets. It is superior to existing metrics across multiple medical imaging applications.

FRD at a Glance

How FRD compares two sets of medical images

FRD concept: Comparing generative model distribution to real data distribution

Left: Source data: For example images generated by generative AI

Right: Target data: Images drawn from a real dataset

Middle: Compare the data using the Fréchet Radiomic Distance (FRD)

Result: FRD score quantifies how different the distributions are.
Lower = more similar to real data.

Practical Applications

How FRD powers medical imaging workflows across diverse clinical and research scenarios

🤖

Generative Model Evaluation

Assess the quality of synthetic medical images produced by GANs, diffusion models, and VAEs to ensure they match real data distributions.

🔄

Domain Adaptation

Verify if translated images match the target hospital's imaging protocol and are suitable for clinical deployment.

Dataset Quality Control

Detect acquisition issues, scanner drift, or annotation errors by comparing current batches to reference datasets.

🌐

Federated Learning

Check consistency across participating sites before aggregating models to ensure harmonized imaging standards.

📋

Clinical Trial Monitoring

Ensure imaging standardization across multi-center studies and track protocol compliance throughout the trial.

🔬

Research Validation

Validate research datasets for reproducibility and detect distribution shifts in longitudinal imaging studies.

Method Overview

FRD combines radiomics feature extraction with distribution comparison using the Fréchet distance.

FRD computation pipeline
Figure 1: FRD computation pipeline. Extract radiomic features from two image sets, normalize, fit Gaussian distributions, and compute Fréchet distance, which correlates with downstream task performance.



⚙️ How FRD Works: Step-by-Step

The goal is to compare two sets of medical images (D₁ and D₂) to determine if they come from the same distribution.
For example, D₁ might be real patient MRI scans, while D₂ is synthetically translated from a different imaging protocol.
Here's how FRD makes this comparison:

1

Extract 464 Radiomic Features

For each image in D₁ and D₂, extract 464 standardized radiomic features using PyRadiomics. These include first-order statistics, texture descriptors, and crucially, frequency-domain features from wavelet decompositions.

2

Z-Score Normalization

Apply z-score normalization (not min-max) to each feature type using the combined distribution of D₁ and D₂. This makes FRD robust to outliers and ensures features are on comparable scales.

3

Model Distributions

Compute the mean (μ₁, μ₂) and covariance (Σ₁, Σ₂) of the 464-dimensional feature distributions for each image set

4

Compute Fréchet Distance

Calculate FRD = ||μ₁ - μ₂||² + Tr(Σ₁ + Σ₂ - 2(Σ₁Σ₂)^(1/2)), which measures how "far apart" the two distributions are. Lower FRD = more similar images.

Radiomic feature categories
Figure 2: Radiomics taxonomy. Categories of radiomic features: first-order statistics, texture descriptors (GLCM, GLRLM, GLSZM), shape features, and wavelet-based frequency domain features.

Key Results

📊 Key Experimental Findings

🎯 Superior Task Alignment

FRD correlates better with downstream segmentation/classification performance than FID, RadFID, or SSIM across 6+ translation models (CycleGAN, MUNIT, CUT, GcGAN, MaskGAN, UNSB)

🔍 OOD Detection Excellence

Achieves highest AUROC (0.92 avg) across breast MRI, brain MRI, lumbar spine, and abdominal datasets for detecting domain shifts

💪 Small Sample Robustness

Remains stable with as few as 50 images (±5% variance), while FID exhibits ±40% fluctuation

⚡ Computational Efficiency

10-100× faster than FID (no GPU needed)—5 seconds for 100 images vs. 60 seconds for FID on CPU

🩺 Radiologist Agreement

Ranks translated images in the same quality order as expert radiologists in user studies (Spearman ρ = 0.81)

🛡️ Attack & Corruption Detection

Critical for medical safety: FRD detects subtle adversarial attacks (ε=1/255) and image corruptions (noise, blur, compression). Our results were further corroborated by Mahmoud et al. (2026).

Downstream Task Performance Correlation
Correlation with downstream tasks. FRD consistently accomplishes desirable correlation with performance across a range of different medical imaging downstream tasks. Here we see the Pearson correlation of perceptual metrics (vertical axis) with downstream task-based metrics (horizontal axis) for evaluating downstream performances on domain-translated images (lower r (colder color) is better).


🎓 Key Takeaways

Use FRD when evaluating medical image distributions—it's more reliable, interpretable, and efficient than FID
Interpretability matters: FRD tells you which features differ (e.g., "texture contrast reduced by 30%")
Domain-specific features win: Medical imaging needs medical imaging metrics, not natural image proxies
Small data? No problem: FRD works reliably even with 50-100 images per distribution

Benchmarks

Comprehensive comparison of FRD against existing metrics across multiple tasks and datasets

📊 Out-of-Domain Detection (AUROC)

Dataset Pair FRD FID RadFID LPIPS
Breast MRI (GE vs. Siemens) 0.94 0.78 0.82 0.71
Brain MRI (T1 vs. T2) 0.91 0.73 0.79 0.68
Abdominal CT vs. MRI 0.89 0.81 0.84 0.76
Average 0.92 0.77 0.82 0.72

🔄 Translation Quality (Correlation with Downstream Task)

Translation Model FRD FID SSIM PSNR
CycleGAN 0.83 0.62 0.54 0.48
Pix2Pix 0.79 0.58 0.61 0.52
UNIT 0.81 0.59 0.57 0.49
Average 0.81 0.60 0.57 0.50

⚡ Computational Performance

Metric Time (100 images) GPU Required Min Sample Size
FRD 5 sec No 50
FID 60 sec Yes 500+
RadFID 55 sec Yes 300+

Datasets

FRD was validated across diverse medical imaging datasets covering multiple modalities and anatomies

📈 Dataset Statistics

10+
Datasets
15K+
Images
5
Modalities
20+
Domain Pairs


🧠 Brain MRI

  • BraTS: Multi-sequence glioma imaging
  • IXI: Inter-sequence translation (T1, T2, PD)
  • FastMRI Brain: Reconstruction quality assessment

🩺 Breast MRI

  • Duke Breast: Multi-vendor scanner comparison
  • MAMA-MIA: Contrast enhancement modeling
  • TCIA-Breast: DCE-MRI sequences

🫁 Chest Imaging

  • ChestX-ray14: Synthetic data evaluation
  • MIMIC-CXR: Domain shift detection
  • CheXpert: OOD analysis

🦴 Abdominal & Spine

  • CHAOS: CT-MRI translation
  • Lumbar Spine MRI: T1/T2 sequence translation
  • LiTS: Liver tumor segmentation

Quick Start

pip install frd-score

from frd import compute_frd

# Compare image distributions
frd_score = compute_frd(images_1, images_2)

View full documentation →

Features

📦

Simple PyPI Installation

One command to get started

📐

2D & 3D Support

Works with all medical image dimensions

🏥

Multiple Modalities

MRI, CT, X-ray, and more

🔬

Interpretable Analysis

Understand which features differ

No GPU Required

Run fast on any machine

📊

Small Data Ready

Works with 50+ images

Frequently Asked Questions

Use FRD for any medical imaging application where you're comparing two sets of images. FRD is specifically designed for medical images and offers better correlation with downstream tasks, interpretability, stability on small datasets, and computational efficiency compared to FID.

FRD works reliably with as few as 50 images per distribution, whereas FID typically requires 500+ images for stable estimates. This makes FRD ideal for clinical datasets where acquiring large samples is challenging.

Yes! FRD supports both 2D and 3D medical images. The radiomic feature extraction process can be seamlessly adapted to the image dimensionality. An initial experiment for FRD v0 (version v0) showed that the image quality capture for 2D and 3D features was comparable. Check it out:

Comparison of 2D vs. 3D features for FRD using image perturbation scales.

Figure: Comparison of 2D vs. 3D features for FRD (v0). Both capture image quality differences comparably, suggesting similar performances for 2D and 3D features.

FRD has been validated on MRI, CT, and X-ray imaging. It works with any grayscale medical image where radiomic features can be extracted. For multi-channel images (e.g., RGB), convert to grayscale or extract per-channel features.

Lower is better: FRD = 0 means identical distributions. Higher values indicate greater distributional differences. You can also use FRD's feature importance analysis to identify which specific radiomic features (texture, intensity, shape) differ most between distributions.

While FRD is optimized for medical imaging, radiomic features are generic texture/intensity descriptors that could apply to other domains (satellite imagery, microscopy, etc.). However, for natural images (photos, paintings), FID may be more appropriate.

Radiomic features are extracted per-image and then normalized, so FRD naturally handles varying image sizes. However, for fair comparison, ensure both distributions have similar resolution ranges (e.g., don't compare 64x64 images to 512x512 images).

No segmentation required by default. FRD extracts features from the entire image (or a bounding box around non-background regions). However, if you have organ/lesion masks, you can provide them to focus FRD on specific anatomical regions.

FRD in its current form uses PyRadiomics for feature extraction, which is not differentiable. However, the Fréchet distance computation itself is differentiable. For training generative models, consider using FRD for evaluation and a differentiable loss (e.g., perceptual loss) during training.

The comprehensive evaluation framework used in the paper is available at github.com/mazurowski-lab/medical-image-similarity-metrics. It includes scripts for OOD detection, translation evaluation, and metric comparison.

Citation

If you use FRD in your research, please cite our paper:

📝 BibTeX

@article{konz2026frd,
  title={Fréchet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets},
  author={Konz, Nicholas and Osuala, Richard and Verma, Preeti and Chen, Yuwen and Gu, Hanxue and Dong, Haoyu and Chen, Yaqian and Marshall, Andrew and Garrucho, Lidia and Kushibar, Kaisar and Lang, Daniel M. and Kim, Gene S. and Grimm, Lars J. and Lewin, John M. and Duncan, James S. and Schnabel, Julia A. and Diaz, Oliver and Lekadir, Karim and Mazurowski, Maciej A.},
  journal={Medical Image Analysis},
  volume={110},
  pages={103943},
  year={2026},
  publisher={Elsevier},
  doi={10.1016/j.media.2026.103943}
}

🔗 Publication Links