Photo Credit: Jacob Wackerhausen
The following is a summary of “Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study,” published in the February 2025 issue of BMC Critical Care by Workum et al.
Researchers conducted a retrospective study to explore the performance of large language models (LLMs) in administrative support and clinical decision-making within critical care medicine.
They evaluated 5 LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407, and Llama 3.1 70B) using 1,181 multiple-choice questions (MCQs) from the gotheextramile.com database, covering European Diploma in Intensive Care examination-level content. Performance was compared against random guessing and 350 human physicians on a 77-MCQ practice test. Accuracy, consistency, domain-specific performance, and costs as a proxy for energy consumption were analyzed.
The results showed that GPT-4o achieved the highest accuracy at 93.3%, followed by Mistral Large 2407 (87.9%), Llama 3.1 70B (87.5%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (P < 0.001). On the 77-MCQ practice test, all models outperformed human physicians, scoring 89.0%, 84.4%, 80.9%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (P < 0.001) and 61.9% for human physicians, GPT-3.5-turbo, however, did not significantly exceed physician performance (P = 0.196). Despite high consistency, all models consistently provided incorrect answers, GPT-4o was the most expensive, costing over 25 times more than GPT-4o-mini.
Investigators concluded that while LLMs demonstrated impressive accuracy and consistency, with some surpassing human physicians on a practice exam, their critical care performance revealed concerning inaccuracies, emphasizing the need for rigorous and continuous evaluation before responsible clinical implementation.
Source: ccforum.biomedcentral.com/articles/10.1186/s13054-025-05302-0