🩺 RadEval: A framework for radiology text evaluation

Github | PyPI | Video | arXiv | RadEval_ModernBERT Model | Expert Dataset

🏎️ RadEval Evaluation

RadEval is a lightweight, extensible framework for evaluating radiology reports using both standard NLP metrics (e.g. BLEU, ROUGE, BERTScore) and radiology-specific measures (e.g. RadGraph, CheXbert, GREEN). Whether you're benchmarking generation systems or validating clinical correctness, RadEval offers comprehensive and interpretable metrics out of the box.

⚠️ Performance Warning ⚠️

The demo is currently running on CPU. When using some slower metrics (like RadGraph, CheXbert, GREEN), it may take a while to complete evaluation. Please be patient.

📄 Reference Report (Ground Truth)

The ground truth or expert-written report

🤖 Hypothesis Report (Generated)

The AI-generated or system-produced report

📊 Results will appear here after evaluation...

Select your texts and metrics, then click 'Run RadEval'.

📈 Detailed Scores

🩺 RadEval: A framework for radiology text evaluation

Github | PyPI | Video | arXiv | RadEval_ModernBERT Model | Expert Dataset

🏎️ RadEval Evaluation

📊 Available Metrics:

⚡ Performance Notes: