Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot

Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot Publisher Pubmed

Dastani M ; Sajjadi MS ; Yamout B ; Arab Bafrani M ; Nasirzadeh A

Source: PLOS ONE Published:2026

Abstract

Objective To evaluate and compare the performance of four publicly available large language models—ChatGPT, Gemini, Copilot, and Grok—in answering medical questions related to Multiple Sclerosis, focusing on accuracy, transparency, and clinical actionability. Methods Four publicly available large language models (ChatGPT, Gemini, Grok, and Copilot) were selected based on accessibility and their ability to respond to medical questions. A total of 25 questions—five for each of the five key domains (diagnosis, treatment, prevention, disease control, and disease management)—were developed. The responses generated by the models were evaluated using the DISCERN-AI and NLAT-AI assessment tools. Results The evaluation of four AI chatbots—ChatGPT, Gemini, Copilot, and Grok—on multiple sclerosis (MS) content revealed clear differences in quality and consistency. According to DISCERN-AI criteria, Gemini achieved the highest overall quality, excelling in relevance, transparency, balance, and acknowledgment of uncertainty. Grok ranked second, showing generally balanced results with slightly lower scores than Gemini. ChatGPT exhibited strong yet uneven performance, with particular weaknesses in content addressing vulnerable populations. Copilot demonstrated the weakest overall performance, with consistently lower scores across nearly all criteria. Conclusions Gemini demonstrated the strongest and most consistent performance across all domains, followed by Grok with slightly lower but balanced results. ChatGPT showed strong yet uneven outcomes, with weaknesses in addressing vulnerable populations. Copilot ranked lowest, consistently underperforming across metrics. These findings highlight significant differences among large language models in generating accurate and clinically relevant responses for multiple sclerosis, underscoring the importance of considering each model’s strengths and limitations in healthcare applications. © 2026 Dastani et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Related Docs

View other Related Docs

1. Methodological Insights Into Chatgpt’S Screening Performance in Systematic Reviews, BMC Medical Research Methodology (2024)

2. Patient Education in Bariatric Surgery: Can Artificial Intelligence–Based Chatbots Bridge the Knowledge Gap?, Journal of Obesity (2026)

3. Evaluation of Correctness and Reliability of Gpt, Bard, and Bing Chatbots’ Responses in Basic Life Support Scenarios, Scientific Reports (2025)

Experts (# of related papers)

View all Related Experts

Hossein Ghanaati (1)

Hamideh Akbari (1)

Style	Citing Format
MLA	Dastani M, et al.. "Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot." PLOS ONE, vol. 21, no. 5 May, 2026, pp. -.
APA	Dastani M, Sajjadi MS, Yamout B, Arab Bafrani M, Nasirzadeh A (2026). Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot. PLOS ONE, 21(5 May), -.
Chicago	Dastani M, Sajjadi MS, Yamout B, Arab Bafrani M, Nasirzadeh A. "Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot." PLOS ONE 21, no. 5 May (2026): -.
Harvard	Dastani M et al. (2026) 'Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot', PLOS ONE, 21(5 May), pp. -.
Vancouver	Dastani M, Sajjadi MS, Yamout B, Arab Bafrani M, Nasirzadeh A. Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot. PLOS ONE. 2026;21(5 May):-.
BibTex	@article{ author = {Dastani M and Sajjadi MS and Yamout B and Arab Bafrani M and Nasirzadeh A}, title = {Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot}, journal = {PLOS ONE}, volume = {21}, number = {5 May}, pages = {-}, year = {2026} }
RIS	TY - JOUR AU - Dastani M AU - Sajjadi MS AU - Yamout B AU - Arab Bafrani M AU - Nasirzadeh A TI - Evaluation of the Performance of Large Language Models in Responding to Medical Questions Related to Multiple Sclerosis: A Case Study of Large Language Models Including Chatgpt, Gemini, Grok and Copilot JO - PLOS ONE VL - 21 IS - 5 May SP - EP - PY - 2026 ER -

Science Communicator Platform

Authors

Abstract