Tehran University of Medical Sciences

Science Communicator Platform

Share By
Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot Publisher Pubmed



Dastani M ; Sajjadi MS ; Yamout B ; Arab Bafrani M ; Nasirzadeh A
Authors

Source: PLOS ONE Published:2026


Abstract

Objective To evaluate and compare the performance of four publicly available large language models—ChatGPT, Gemini, Copilot, and Grok—in answering medical questions related to Multiple Sclerosis, focusing on accuracy, transparency, and clinical actionability. Methods Four publicly available large language models (ChatGPT, Gemini, Grok, and Copilot) were selected based on accessibility and their ability to respond to medical questions. A total of 25 questions—five for each of the five key domains (diagnosis, treatment, prevention, disease control, and disease management)—were developed. The responses generated by the models were evaluated using the DISCERN-AI and NLAT-AI assessment tools. Results The evaluation of four AI chatbots—ChatGPT, Gemini, Copilot, and Grok—on multiple sclerosis (MS) content revealed clear differences in quality and consistency. According to DISCERN-AI criteria, Gemini achieved the highest overall quality, excelling in relevance, transparency, balance, and acknowledgment of uncertainty. Grok ranked second, showing generally balanced results with slightly lower scores than Gemini. ChatGPT exhibited strong yet uneven performance, with particular weaknesses in content addressing vulnerable populations. Copilot demonstrated the weakest overall performance, with consistently lower scores across nearly all criteria. Conclusions Gemini demonstrated the strongest and most consistent performance across all domains, followed by Grok with slightly lower but balanced results. ChatGPT showed strong yet uneven outcomes, with weaknesses in addressing vulnerable populations. Copilot ranked lowest, consistently underperforming across metrics. These findings highlight significant differences among large language models in generating accurate and clinically relevant responses for multiple sclerosis, underscoring the importance of considering each model’s strengths and limitations in healthcare applications. © 2026 Dastani et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.