๐ฉบ RadEval: A framework for radiology text evaluation
Github | PyPI | Video | arXiv | RadEval_ModernBERT Model | Expert Dataset
๐๏ธ RadEval Evaluation
RadEval is a lightweight, extensible framework for evaluating radiology reports using both standard NLP metrics (e.g. BLEU, ROUGE, BERTScore) and radiology-specific measures (e.g. RadGraph, CheXbert, GREEN). Whether you're benchmarking generation systems or validating clinical correctness, RadEval offers comprehensive and interpretable metrics out of the box.
โ ๏ธ Performance Warning โ ๏ธ
The demo is currently running on CPU. When using some slower metrics (like RadGraph, CheXbert, GREEN), it may take a while to complete evaluation. Please be patient.
๐ Results will appear here after evaluation...
Select your texts and metrics, then click 'Run RadEval'.
๐ Detailed Scores
๐ Available Metrics:
Traditional NLG Metrics:
- BLEU: N-gram overlap between reference and hypothesis
- ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-2, ROUGE-L)
- BERTScore: Semantic similarity using BERT embeddings
Radiology-Specific Metrics:
- RadGraph F1: Entity and relation extraction for radiology
- CheXbert F1: Chest X-ray finding classification performance
- RaTEScore: Radiology-aware text evaluation score
- RadCliQ: Composite metric for radiology reports
- Temporal F1: Temporal entity and relationship evaluation
- RadEval BERTScore: Specialized BERT for radiology text
- GREEN: Generative evaluation with natural language explanations
- SRR-BERT: Structured radiology reasoning evaluation
โก Performance Notes:
- Fast: BLEU, ROUGE, BERTScore, Temporal F1
- Medium: RadEval BERTScore, RaTEScore, RadCliQ, SRR-BERT
- Slow: CheXbert F1, RadGraph F1, GREEN (requires model downloads)