Tehran University of Medical Sciences

Science Communicator Platform

Stay connected! Follow us on X network (Twitter):
Share this content! On (X network) By
Diagnostic Performance of Chatgpt in Tibial Plateau Fracture in Knee X-Ray Publisher Pubmed



Mohammadi M1 ; Parviz S2 ; Parvaz P3 ; Pirmoradi MM1 ; Afzalimoghaddam M1, 4 ; Mirfazaelian H4
Authors
Show Affiliations
Authors Affiliations
  1. 1. Emergency Medicine Department, Tehran University of Medical Sciences, Tehran, Iran
  2. 2. Musculoskeletal Imaging Research Center (MIRC), Tehran University of Medical Sciences, Tehran, Iran
  3. 3. Radiology Department, Tehran University of Medical Sciences, Tehran, Iran
  4. 4. Prehospital and Hospital Emergency Research Center, Tehran University of Medical Sciences, Tehran, Iran

Source: Emergency Radiology Published:2025


Abstract

Purpose: Tibial plateau fractures are relatively common and require accurate diagnosis. Chat Generative Pre-Trained Transformer (ChatGPT) has emerged as a tool to improve medical diagnosis. This study aims to investigate the accuracy of this tool in diagnosing tibial plateau fractures. Methods: A secondary analysis was performed on 111 knee radiographs from emergency department patients, with 29 confirmed fractures by computed tomography (CT) imaging. The X-rays were reviewed by a board-certified emergency physician (EP) and radiologist and then analyzed by ChatGPT-4 and ChatGPT-4o. The diagnostic performances were compared using the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, and likelihood ratios were also calculated. Results: The results indicated a sensitivity and negative likelihood ratio of 58.6% (95% CI: 38.9 − 76.4%) and 0.4 (95% CI: 0.3–0.7) for the EP, 72.4% (95% CI: 52.7 − 87.2%) and 0.3 (95% CI: 0.2–0.6) for the radiologist, 27.5% (95% CI: 12.7 − 47.2%) and 0.7 (95% CI: 0.6–0.9) for ChatGPT-4, and 55.1% (95% CI: 35.6 − 73.5%) and 0.4 (95% CI: 0.3–0.7) for ChatGPT4o. The specificity and positive likelihood ratio were 85.3% (95% CI: 75.8 − 92.2%) and 4.0 (95% CI: 2.1–7.3) for the EP, 76.8% (95% CI: 66.2 − 85.4%) and 3.1 (95% CI: 1.9–4.9) for the radiologist, 95.1% (95% CI: 87.9 − 98.6%) and 5.6 (95% CI: 1.8–17.3) for ChatGPT-4, and 93.9% (95% CI: 86.3 − 97.9%) and 9.0 (95% CI: 3.6–22.4) for ChatGPT4o. The area under the receiver operating characteristic curve (AUC) was 0.72 (95% CI: 0.6–0.8) for the EP, 0.75 (95% CI: 0.6–0.8) for the radiologist, 0.61 (95% CI: 0.4–0.7) for ChatGPT-4, and 0.74 (95% CI: 0.6–0.8) for ChatGPT4-o. The EP and radiologist significantly outperformed ChatGPT-4 (P value = 0.02 and 0.01, respectively), whereas there was no significant difference between the EP, ChatGPT-4o, and radiologist. Conclusion: ChatGPT-4o matched the physicians’ performance and also had the highest specificity. Similar to the physicians, ChatGPT chatbots were not suitable for ruling out the fracture. © The Author(s), under exclusive licence to American Society of Emergency Radiology (ASER) 2024.