๐Ÿฉบ RadEval: A framework for radiology text evaluation

Github | PyPI | Video | arXiv | RadEval_ModernBERT Model | Expert Dataset

๐ŸŽ๏ธ RadEval Evaluation

RadEval is a lightweight, extensible framework for evaluating radiology reports using both standard NLP metrics (e.g. BLEU, ROUGE, BERTScore) and radiology-specific measures (e.g. RadGraph, CheXbert, GREEN). Whether you're benchmarking generation systems or validating clinical correctness, RadEval offers comprehensive and interpretable metrics out of the box.

โš ๏ธ Performance Warning โš ๏ธ

The demo is currently running on CPU. When using some slower metrics (like RadGraph, CheXbert, GREEN), it may take a while to complete evaluation. Please be patient.

๐Ÿ“‹ Choose Example or Custom Input
๐ŸŽฏ Select Evaluation Metrics

Select metrics to compute. Some metrics may take longer (RadGraph, CheXbert, GREEN).

๐Ÿ“Š Results will appear here after evaluation...

Select your texts and metrics, then click 'Run RadEval'.

๐Ÿ“ˆ Detailed Scores

๐Ÿ“Š Available Metrics:

Traditional NLG Metrics:

  • BLEU: N-gram overlap between reference and hypothesis
  • ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-2, ROUGE-L)
  • BERTScore: Semantic similarity using BERT embeddings

Radiology-Specific Metrics:

  • RadGraph F1: Entity and relation extraction for radiology
  • CheXbert F1: Chest X-ray finding classification performance
  • RaTEScore: Radiology-aware text evaluation score
  • RadCliQ: Composite metric for radiology reports
  • Temporal F1: Temporal entity and relationship evaluation
  • RadEval BERTScore: Specialized BERT for radiology text
  • GREEN: Generative evaluation with natural language explanations
  • SRR-BERT: Structured radiology reasoning evaluation

โšก Performance Notes:

  • Fast: BLEU, ROUGE, BERTScore, Temporal F1
  • Medium: RadEval BERTScore, RaTEScore, RadCliQ, SRR-BERT
  • Slow: CheXbert F1, RadGraph F1, GREEN (requires model downloads)